* [PATCHED][RFC][CFT] mount-related stuff @ 2025-08-25 4:40 Al Viro 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (3 more replies) 0 siblings, 4 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:40 UTC (permalink / raw) To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner, Jan Kara Most of this pile is basically an attempt to see how well do cleanup.h-style mechanisms apply in mount handling. That stuff lives in git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount Rebased to -rc3 (used to be a bit past -rc2, branched at mount fixes merge) Individual patches in followups. Please, help with review and testing. It seems to survive the local beating and code generation seems to be OK, but more testing would be a good thing and I would really like to see comments on that stuff. This is not all I've got around mount handling, but I'd rather get that thing out for review before starting to sort out other local mount-related branches. Series overview: Part 1: guards. This part starts with infrastructure, followed by one-by-one conversions to the guard/scoped_guard in some of the places that fit that well enough. Note that one of those places turned out to be taking mount_lock for no reason whatsoever; I already see places where we do write_seqlock when read_seqlock_excl would suffice, etc. Folks, _please_ don't do any bulk conversions in that area. IMO one area where RAII becomes dangerous is locking; usually it's not a big deal to delay freeing some object a bit, but delay dropping a lock and you risk introducing deadlocks that will be bloody hard to spot. It _has_ to be done carefully; we had trouble in that area several times over the last year or so in fs/namespace.c alone. Another fun problem is that quite a few comments regarding the locking in there are stale. We still have the comments that talk about mount lock as if it had been an rwlock-like thing. It hadn't been that for more than a decade now. It needs to be documented sanely; so do the access rules to the data structures involved. I hope to get some of that into the tree this cycle, but it's still in progress. 1/52) fs/namespace.c: fix the namespace_sem guard mess New guards: namespace_excl and namespace_shared. The former implies the latter, as for anything rwsem-like. No inode locks, no dropping the final references, no opening files, etc. in scope of those. 2/52) introduced guards for mount_lock New guards: mount_writer, mount_locked_reader. That's write_seqlock and read_seqlock_excl on mount_lock; obviously, nothing blocking should be done in scope of those. 3/52) fs/namespace.c: allow to drop vfsmount references via __free(mntput) Missing DEFINE_FREE (for mntput()); local in fs/namespace.c, to be used only for keeping shit out of namespace_... and mount_... scopes. 4/52) __detach_mounts(): use guards 5/52) __is_local_mountpoint(): use guards 6/52) do_change_type(): use guards 7/52) do_set_group(): use guards 8/52) mark_mounts_for_expiry(): use guards 9/52) put_mnt_ns(): use guards 10/52) mnt_already_visible(): use guards a bunch of clear-cut conversions, with explanations of the reasons why this or that guard is needed. 11/52) check_for_nsfs_mounts(): no need to take locks ... and here we have one where it turns out that locking had been excessive. Iterating through a subtree in mount_locked_reader scope is safe, all right, but (1) mount_writer is not needed here at all and (2) namespace_shared + a reference held to the root of subtree is also enough. All callers had (2) already. Documented the locking requirements for function, removed {,un}lock_mount_hash() in it... 12/52) propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() This one is interesting - existing code had been equivalent to scoped_guard(mount_locked_reader), and it's right for that call. However, mnt_set_mountpoint() generally requires mount_writer - the only reason we get away with that here is that the mount in question never had been reachable from the mounts visible to other threads. 13/52) has_locked_children(): use guards 14/52) mnt_set_expiry(): use guards 15/52) path_is_under(): use guards more clear-cut conversions with explanations. 16/52) current_chrooted(): don't bother with follow_down_one() 17/52) current_chrooted(): use guards this pair might be better off with #16 taken to the beginning of the series (or to a separate branch merge into this one); no better reason to do as I had than wanting to keep the guard infrastructure in the very beginning. Part 2: turning unlock_mount() into __cleanup. Environment for mounting something on given location consists of: 1) namespace_excl scope 2) parent mount - the one we'll be attaching things to. 3) mountpoint to be, protected from disappearing under us. 4) inode of that mountpoint's dentry held exclusive. Unfortunately, we can't take inode locks in namespace_excl scopes. And we want to cope with the possibility that somebody has managed to mount something on that place while we'd been taking locks. "Cope" part is simple for finish_automount() ("drop our mount and go away quietly; somebody triggered it before we did"), but for everything else it's trickier - "use whatever's overmounting that place now (with the right locks, please)". lock_mount() does all of that (do_lock_mount(), actually), with unlock_mount() closing the scope. And it's definitely a good candidate for __cleanup()-based approach, except that * the damn thing can return an error and conditional variants of that infrastructure are too revolting. * parent mount is returned in a fucking awful way - we modify the struct path passed to us as location to mount on and then its ->mnt is the parent to be... except for the "beneath" variant where we play convoluted games with "no, here we want the parent of that". Implementation is also vulnerable to umount propagtion races. * the structure we set up (everything except the parent) is inserted into a linked list by lock_mount(). That excludes DEFINE_CLASS() - it wants the value formed and then copied to the variable we are defining. * it contains an implicit namespace_excl scope, so path_put() and its ilk *must* be done after the unlock_mount(). And most of the users have gotos past that. The first two problems are solved by adding an explicit pointer to parent mount into struct pinned_mountpoint. Having lock_mount() failure reported by setting it to ERR_PTR(-E...) allows to avoid the problem with expressing the constructor failure. The third one is dealt with by defining local macros to be used instead of CLASS - I went with LOCK_MOUNT(mp, path) which defines struct pinned_mountpoint mp with __cleanup(unlock_mount) and sets it up. If anybody has better suggestions, I'll be glad to hear those. The last one is dealt with by massaging the users to form that would have all post-unlock_mount() stuff done by __free(). First, several trivial cleanups: 18/52) do_move_mount(): trim local variables 19/52) do_move_mount(): deal with the checks on old_path early 20/52) move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() 21/52) finish_automount(): simplify the ELOOP check Getting rid of post-unlock_mount() stuff: 22/52) do_loopback(): use __free(path_put) to deal with old_path 23/52) pivot_root(2): use __free() to deal with struct path in it 24/52) finish_automount(): take the lock_mount() analogue into a helper this one turns the open-coded logics into lock_mount_exact() with the same kind of calling conventions as lock_mount() and do_lock_mount() 25/52) do_new_mount_rc(): use __free() to deal with dropping mnt on failure 26/52) finish_automount(): use __free() to deal with dropping mnt on failure This is the main part: 27/52) change calling conventions for lock_mount() et.al. Followups, cleaning up the games with parent mount in the user: 28/52) do_move_mount(): use the parent mount returned by do_lock_mount() 29/52) do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path 30/52) graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Part 3: getting rid of mutating struct path there. do_lock_mount() is still playing silly buggers with struct path it had been given - the logics in that thing hadn't changed. It's not a pretty function and it's racy as well; the thing is, by this point its users have almost no use for the changed contents of struct path - dentry can be derived from struct mountpoint, parent mount to use is provided directly and we want that a lot more than modified path->mnt. There's only one place (in can_move_mount_beneath()) where we still want that and it's not hard to reconstruct the value by *original* path->mnt value + parent mount to be used. Getting rid of ->dentry uses. 31/52) pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry 32/52) don't bother passing new_path->dentry to can_move_mount_beneath() A helper, already open-coded in a couple of places; carved out of the next patch to keep it reasonably small 33/52) new helper: topmost_overmount() Rewrite of do_lock_mount() to keep path constant + trivial change in do_move_mount() to adjust the argument it passes to can_move_mount_beneath(): 34/52) do_lock_mount(): don't modify path. Part 5: a bunch of trivial cleanups (mostly constifications) 35/52) constify check_mnt() 36/52) do_mount_setattr(): constify path argument 37/52) do_set_group(): constify path arguments 38/52) drop_collected_paths(): constify arguments 39/52) collect_paths(): constify the return value 40/52) do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) 41/52) mnt_warn_timestamp_expiry(): constify struct path argument 42/52) do_new_mount{,_fc}(): constify struct path argument 43/52) do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument 44/52) path_mount(): constify struct path argument 45/52) may_copy_tree(), __do_loopback(): constify struct path argument 46/52) path_umount(): constify struct path argument 47/52) constify can_move_mount_beneath() arguments 48/52) do_move_mount_old(): use __free(path_put) 49/52) do_mount(): use __free(path_put) Part 6: assorted stuff, will grow. 50/52) umount_tree(): take all victims out of propagation graph at once [had been earlier] For each removed mount we need to calculate where the slaves will end up. To avoid duplicating that work, do it for all mounts to be removed at once, taking the mounts themselves out of propagation graph as we go, then do all transfers; the duplicate work on finding destinations is avoided since if we run into a mount that already had destination found, we don't need to trace the rest of the way. That's guaranteed O(removed mounts) for finding destinations and removing from propagation graph and O(surviving mounts that have master removed) for transfers. 51/52) ecryptfs: get rid of pointless mount references in ecryptfs dentries ->lower_path.mnt has the same value for all dentries on given ecryptfs instance and if somebody goes for mountpoint-crossing variant where that would not be true, we can deal with that when it happens (and _not_ with duplicating these reference into each dentry). As it is, we are better off just sticking a reference into ecryptfs-private part of superblock and keeping it pinned until ->kill_sb(). That way we can stick a reference to underlying dentry right into ->d_fsdata of ecryptfs one, getting rid of indirection through struct ecryptfs_dentry_info, along with the entire struct ecryptfs_dentry_info machinery. 52/52) fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Comments regarding "shadow mounts" were stale - no such thing anymore. Document the locking requirements for __lookup_mnt()... FWIW, the current diffstat: fs/ecryptfs/dentry.c | 14 +- fs/ecryptfs/ecryptfs_kernel.h | 27 +- fs/ecryptfs/file.c | 15 +- fs/ecryptfs/inode.c | 19 +- fs/ecryptfs/main.c | 24 +- fs/internal.h | 4 +- fs/mount.h | 12 + fs/namespace.c | 775 +++++++++++++++++++----------------------- fs/pnode.c | 75 ++-- fs/pnode.h | 1 + include/linux/mount.h | 4 +- kernel/audit_tree.c | 12 +- 12 files changed, 464 insertions(+), 518 deletions(-) ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess 2025-08-25 4:40 [PATCHED][RFC][CFT] mount-related stuff Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 4:43 ` [PATCH 02/52] introduced guards for mount_lock Al Viro ` (51 more replies) 2025-08-25 12:26 ` [PATCHED][RFC][CFT] mount-related stuff Christian Brauner ` (2 subsequent siblings) 3 siblings, 52 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds If anything, namespace_lock should be DEFINE_LOCK_GUARD_0, not DEFINE_GUARD. That way we * do not need to feed it a bogus argument * do not get gcc trying to compare an address of static in file variable with -4097 - and, if we are unlucky, trying to keep it in a register, with spills and all such. The same problems apply to grabbing namespace_sem shared. Rename it to namespace_excl, add namespace_shared, convert the existing users: guard(namespace_lock, &namespace_sem) => guard(namespace_excl)() guard(rwsem_read, &namespace_sem) => guard(namespace_shared)() scoped_guard(namespace_lock, &namespace_sem) => scoped_guard(namespace_excl) scoped_guard(rwsem_read, &namespace_sem) => scoped_guard(namespace_shared) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index ae6d1312b184..fcea65587ff9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -82,6 +82,12 @@ static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ static struct mnt_namespace *emptied_ns; /* protected by namespace_sem */ static DEFINE_SEQLOCK(mnt_ns_tree_lock); +static inline void namespace_lock(void); +static void namespace_unlock(void); +DEFINE_LOCK_GUARD_0(namespace_excl, namespace_lock(), namespace_unlock()) +DEFINE_LOCK_GUARD_0(namespace_shared, down_read(&namespace_sem), + up_read(&namespace_sem)) + #ifdef CONFIG_FSNOTIFY LIST_HEAD(notify_list); /* protected by namespace_sem */ #endif @@ -1776,8 +1782,6 @@ static inline void namespace_lock(void) down_write(&namespace_sem); } -DEFINE_GUARD(namespace_lock, struct rw_semaphore *, namespace_lock(), namespace_unlock()) - enum umount_tree_flags { UMOUNT_SYNC = 1, UMOUNT_PROPAGATE = 2, @@ -2306,7 +2310,7 @@ struct path *collect_paths(const struct path *path, struct path *res = prealloc, *to_free = NULL; unsigned n = 0; - guard(rwsem_read)(&namespace_sem); + guard(namespace_shared)(); if (!check_mnt(root)) return ERR_PTR(-EINVAL); @@ -2361,7 +2365,7 @@ void dissolve_on_fput(struct vfsmount *mnt) return; } - scoped_guard(namespace_lock, &namespace_sem) { + scoped_guard(namespace_excl) { if (!anon_ns_root(m)) return; @@ -2435,7 +2439,7 @@ struct vfsmount *clone_private_mount(const struct path *path) struct mount *old_mnt = real_mount(path->mnt); struct mount *new_mnt; - guard(rwsem_read)(&namespace_sem); + guard(namespace_shared)(); if (IS_MNT_UNBINDABLE(old_mnt)) return ERR_PTR(-EINVAL); @@ -5957,7 +5961,7 @@ SYSCALL_DEFINE4(statmount, const struct mnt_id_req __user *, req, if (ret) return ret; - scoped_guard(rwsem_read, &namespace_sem) + scoped_guard(namespace_shared) ret = do_statmount(ks, kreq.mnt_id, kreq.mnt_ns_id, ns); if (!ret) @@ -6079,7 +6083,7 @@ SYSCALL_DEFINE4(listmount, const struct mnt_id_req __user *, req, * We only need to guard against mount topology changes as * listmount() doesn't care about any mount properties. */ - scoped_guard(rwsem_read, &namespace_sem) + scoped_guard(namespace_shared) ret = do_listmount(ns, kreq.mnt_id, last_mnt_id, kmnt_ids, nr_mnt_ids, (flags & LISTMOUNT_REVERSE)); if (ret <= 0) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH 02/52] introduced guards for mount_lock 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:32 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 03/52] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro ` (50 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds mount_writer: write_seqlock; that's an equivalent of {un,}lock_mount_hash() mount_locked_reader: read_seqlock_excl; these tend to be open-coded. No bulk conversions, please - if nothing else, quite a few places take use mount_writer form when mount_locked_reader is sufficent. It needs to be dealt with carefully. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/mount.h b/fs/mount.h index 97737051a8b9..ed8c83ba836a 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -154,6 +154,11 @@ static inline void get_mnt_ns(struct mnt_namespace *ns) extern seqlock_t mount_lock; +DEFINE_LOCK_GUARD_0(mount_writer, write_seqlock(&mount_lock), + write_sequnlock(&mount_lock)) +DEFINE_LOCK_GUARD_0(mount_locked_reader, read_seqlock_excl(&mount_lock), + read_sequnlock_excl(&mount_lock)) + struct proc_mounts { struct mnt_namespace *ns; struct path root; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 02/52] introduced guards for mount_lock 2025-08-25 4:43 ` [PATCH 02/52] introduced guards for mount_lock Al Viro @ 2025-08-25 12:32 ` Christian Brauner 2025-08-25 13:46 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:32 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:05AM +0100, Al Viro wrote: > mount_writer: write_seqlock; that's an equivalent of {un,}lock_mount_hash() > mount_locked_reader: read_seqlock_excl; these tend to be open-coded. Do we really need the "locked" midfix in there? Doesn't seem to buy any clarity. I'd drop it so the naming is nicely consistent. > > No bulk conversions, please - if nothing else, quite a few places take > use mount_writer form when mount_locked_reader is sufficent. It needs > to be dealt with carefully. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/mount.h | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/fs/mount.h b/fs/mount.h > index 97737051a8b9..ed8c83ba836a 100644 > --- a/fs/mount.h > +++ b/fs/mount.h > @@ -154,6 +154,11 @@ static inline void get_mnt_ns(struct mnt_namespace *ns) > > extern seqlock_t mount_lock; > > +DEFINE_LOCK_GUARD_0(mount_writer, write_seqlock(&mount_lock), > + write_sequnlock(&mount_lock)) > +DEFINE_LOCK_GUARD_0(mount_locked_reader, read_seqlock_excl(&mount_lock), > + read_sequnlock_excl(&mount_lock)) > + > struct proc_mounts { > struct mnt_namespace *ns; > struct path root; > -- > 2.47.2 > ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 02/52] introduced guards for mount_lock 2025-08-25 12:32 ` Christian Brauner @ 2025-08-25 13:46 ` Al Viro 2025-08-25 20:21 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 13:46 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 02:32:38PM +0200, Christian Brauner wrote: > On Mon, Aug 25, 2025 at 05:43:05AM +0100, Al Viro wrote: > > mount_writer: write_seqlock; that's an equivalent of {un,}lock_mount_hash() > > mount_locked_reader: read_seqlock_excl; these tend to be open-coded. > > Do we really need the "locked" midfix in there? Doesn't seem to buy any > clarity. I'd drop it so the naming is nicely consistent. It's a seqlock. "Readers" is this context are lockless ones - sample/retry under rcu_read_lock() kind. The only difference between writer and locked reader is that locked reader does not disrupt those sample/retry loops. Note that for something that is never traversed locklessly (expiry lists, lists of children, etc.) locked reader is fine for all accesses, including modifications. If you have better suggestions re terminology, I'd love to hear those, but simply "writer"/"reader" is misleadingly similar to rw-semaphors/links/whatnot. Basically, there are 3 kinds of contexts here: 1) lockless, must be under RCU, fairly limited in which pointers they can traverse, read-only access to structures in question. Must sample the seqcount side of mount_lock first, then verifying that it has not changed after everything. 2) hold the spinlock side of mount_lock, _without_ bumping the seqcount one. Can be used for reads and writes, as long as the stuff being modified is not among the things that is traversed locklessly. Do not disrupt the previous class, have full exclusion with calles 2 and 3 3) hold the spinlock side of mount_lock, and bump the seqcount one on entry and leave. Any reads and writes. Full exclusion with classes 2 and 3, invalidates the checks for class 1 (i.e. will push it into retries/fallbacks/ whatnot). I'm used to "lockless reader" for 1, "writer" for 3. "locked reader" kinda works for 2 - that's what it is wrt things that can be accessed by lockless readers, but for the things that are *not* traversed without a lock it can be actually used as a less disruptive form of 3. Is used that way in mount locking for some of the data structures. ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 02/52] introduced guards for mount_lock 2025-08-25 13:46 ` Al Viro @ 2025-08-25 20:21 ` Al Viro 2025-08-25 23:44 ` Al Viro 2025-08-26 15:17 ` Askar Safin 0 siblings, 2 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 20:21 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 02:46:04PM +0100, Al Viro wrote: > Basically, there are 3 kinds of contexts here: > 1) lockless, must be under RCU, fairly limited in which pointers they > can traverse, read-only access to structures in question. Must sample > the seqcount side of mount_lock first, then verifying that it has not changed > after everything. > > 2) hold the spinlock side of mount_lock, _without_ bumping the seqcount > one. Can be used for reads and writes, as long as the stuff being modified > is not among the things that is traversed locklessly. Do not disrupt the previous > class, have full exclusion with calles 2 and 3 > > 3) hold the spinlock side of mount_lock, and bump the seqcount one on > entry and leave. Any reads and writes. Full exclusion with classes 2 and 3, > invalidates the checks for class 1 (i.e. will push it into retries/fallbacks/ > whatnot). FWIW, partial dump from what I hope to push out as docs: * all modifications of mount hash chains must be mount_writer. * only one function is allowed to traverse hash chains - __lookup_mnt(). Important part here is reachability - hash is a shared data structure, but a struct mount instance can be reached that way only if it has parent equal to the argument you've been able to pass to __lookup_mnt(). * callers of __lookup_mnt() must either be at least mount_locked_reader OR hold rcu_read_lock through the entire thing, sample the seqcount side of mount_lock before the call, validate it afterwards and discard the attempt entirely if validation fails. Note that __legitimize_mnt() contains validation. * being hashed contributes 1 to refcount. * (sub)tree topology (encoded in ->mnt_parent, ->mnt_mounts/->mnt_child, ->mnt_mp, ->mnt_mountpoint and ->overmount) is stabilize by either mount_locked_reader OR by namespace_shared + positive refcount for root of subtree. namespace_shared by itself is *NOT* enough. When the last reference to mount past the umount_tree() (i.e. already with NULL ->mnt_ns) goes away, anything subtree stuck to it will be detached from it and have its root unhashed and dropped. In other words, such tree (e.g. result of umount -l) decays from root to leaves - once all references to root are gone, it's cut off and all pieces are left to decay. That is done with mount_writer (has to be - there are mount hash changes and for those mount_writer is a hard requirement) and only after the final reference to root has been dropped. All other topology changes happen with namespace_excl and, at least, mount_locked_reader. Normally - with mount_writer; the only exception is that setting parent for a newly allocated subtree is fine with mount_locked_reader; we are not hashing it yet (that's done only in commit_tree()), so there's no need to disrupt the lockless readers; note that RCU pathwalk *is* such, so blind use of mount_writer has an effect on performance. ->mnt_mounts/->mnt_child is never traversed unless the tree is stabilized by either lock (note that list modifications there are not with ..._rcu() primitives). ->overmount, ->mnt_parent and ->mnt_mountpoint can be; those need sample/validate on the seqcount side; it *would* require mount_write from those who modify them, except that for the ones that had never been reachable yet we don't need to bother. In practice, ->overmount is changed along with the mount hash, so we need mount_writer anyway; ->mnt_parent/->mnt_mountpoint/->mnt_mp need it only for reachable mounts. [[ FWIW, I'm considering the possibility of having copy_tree() delay hashing all nodes in the copy and having them hashed all at once; fewer disruptions for lockless readers that way. All nodes in the copy are reachable only for the caller; we do need mount_locked_reader for attaching a new node to copy (it has to be inserted into the per-mountpoint lists of mounts), but we don't need to bump the seqcount every time - and we can't hold a spinlock over allocations. It's not even that hard; all we'd need is a bit of a change in commit_tree() and in a couple of places where we create a namespace with more than one node - we have the loops in those places already where we insert the mounts into per-namespace rbtrees; same loops could handle hashing them. ]] * propagation graph (->mnt_share, ->mnt_slave/->mnt_slave_list, ->mnt_master, ->mnt_group_id, IS_MNT_SHARED()) is modified only under namespace_excl; all accesses are under at least namespace_shared. Only mounts that belong to a namespace may be reached via those; umount_tree() removed all victims from the graph before it returns and it's impossible to include something that isn't a part of some namespace into the graph afterwards. * ->mnt_expire is accessed (both traversals and modifications) under mount_locked_reader. No lockless traversals there. * per-namespace rbtree (->mnt_node linkage) is modified only under namespace_excl and all traversals are at least namespace_shared. Mount leaving a namespace is removed from that before the end of namespace_excl scope. * ->mnt_root and ->mnt_sb are assign-once; never changed. So's ->mnt_devname, ->mnt_id and ->mnt_id_unique. * per-mountpoint mount lists (->mnt_mp_list) are mount_locked_reader for all accesses (modification and traversal along). * ->prev_ns is a fucking mess. * ->mnt_umount has only transient uses; umount_tree() uses it to link the victims to be dropped at namespace_unlock(), final mntput links the stuck children into a list stashed into ->mnt_stuch_children, also for eventual dropping (by cleanup_mnt()). mount_writer for gathering them into those, nothing for "dissolve and drop everything on the list" - in both cases the lists are visible only to a single thread by that point. ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 02/52] introduced guards for mount_lock 2025-08-25 20:21 ` Al Viro @ 2025-08-25 23:44 ` Al Viro 2025-08-26 1:44 ` Al Viro 2025-08-26 15:17 ` Askar Safin 1 sibling, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 23:44 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 09:21:41PM +0100, Al Viro wrote: > FWIW, I'm considering the possibility of having copy_tree() delay > hashing all nodes in the copy and having them hashed all at once; fewer disruptions > for lockless readers that way. All nodes in the copy are reachable only for the > caller; we do need mount_locked_reader for attaching a new node to copy (it has > to be inserted into the per-mountpoint lists of mounts), but we don't need to > bump the seqcount every time - and we can't hold a spinlock over allocations. > It's not even that hard; all we'd need is a bit of a change in commit_tree() > and in a couple of places where we create a namespace with more than one node - > we have the loops in those places already where we insert the mounts into > per-namespace rbtrees; same loops could handle hashing them. The main issue I'm having with that is that currently "in list of children" implies "hashed"; equivalent, even, except for a transient state seen only in mount_writer. OTOH, having that not true for unreachable mounts... I'm trying to find anything that might care, but I don't see any candidates. It would be nice to have regardless of doing fewer mount_lock seqcount bumps - better isolation from shared data structures until we glue them in place would make for simpler correctness proofs... Anyway, copy_tree() call chains: 1. copy_tree() <- propagate_mnt() <- attach_recursive_mnt(), with the call chain prior to that point being one the <- graft_tree() <- do_loopback() <- graft_tree() <- do_add_mount() <- do_new_mount_fc() <- graft_tree() <- do_add_mount() <- finish_automount() <- do_move_mount(). All of those start inside a lock_mount scope. Result gets passed (prior to return from attach_recursive_mnt(), within an mnt_writer scope there) either to commit_tree() or to umount_tree(), without having been visible to others prior to that. That's creation of secondary copies from mount propagation, for various pathways to mounting stuff. 2. copy_tree() <- __do_loopback() <- do_loopback(). Inside a lock_mount scope. Result gets passed into graft_tree() -> attach_recursive_mnt(). In the latter either it gets passed to commit_tree() (within mount_writer scope, without having been visible to others prior to that), in which case success is reported, or it is left alone and error gets reported; in that case back in do_loopback() it gets passed to umount_tree(), again in mount_writer scope and without having been visible to others prior to that. That's MS_BIND|MS_REC mount(2). 3. copy_tree() <- __do_loopback() <- open_detached_copy(). In namespace_excl scope. Result is fed through a loop that inserts those mounts into rbtree of new namespace (in mount_writer scope) and its root is stored as ->root of that new namespace. Once out of namespace_excl scope, the tree becomes visible (and an extra reference is attached to the file we are opening). That's open_tree(2)/open_tree_attr(2) with OPEN_TREE_CLONE. BTW, a bit of mystery there: insertions into rbtree don't need to be in mount_writer - we do have places where it's done without that, all readers are in namespace_shared scopes *and* the namespace, along with its rbtree, is not visible to anyone yet to start with. If we delay hashing until there it will need mount_writer, though. 4. copy_tree() <- copy_mnt_ns(). In namespace_excl scope. Somewhat similar to the previous, but the namespace is not an anonymous one and we have a couple of extra passes - one might do lock_mnt_tree() (under mount_writer, almost certainly excessive - mount_locked_reader would do just fine) and another (combined with rbtree insertions) finds the counterparts of root and pwd of the caller and flips over to those. Old ones get dropped after we leave the scope. Looks like we should be able to unify quite a bit of logics in populating a new namespace and yes, delaying hash insertions past copy_tree() looks plausible... Incidentally, destruction of new namespace on copy_tree() failure is another mystery: here we do ns_free_inum(&new_ns->ns); dec_mnt_namespaces(new_ns->ucounts); mnt_ns_release(new_ns); and in open_detached_copy() it's free_mnt_ns(ns); They are similar - free_mnt_ns() is if (!is_anon_ns(ns)) ns_free_inum(&ns->ns); dec_mnt_namespaces(ns->ucounts); mnt_ns_tree_remove(ns); and mnt_ns_tree_remove() is a bunch of !is_anon_ns() code, followed by an rcu-delayed mnt_ns_release(). So in case of open_detached_copy(), where the namespace is anonymous, it boils down to an RCU-delayed call of mnt_ns_release()... AFAICS the only possible reasons not to use free_mnt_ns() here are 1) avoiding an RCU-delayed call and 2) conditional removal of ns from mnt_ns_tree. As for the second, couldn't we simply use !list_empty(&ns->mnt_ns_list) as a condition? And avoiding an RCU delay... nice, in principle, but the case when that would've saved us anything is CLONE_NEWNS clone(2) or unshare(2) failing due to severe OOM. Do we give a damn about one extra call_rcu() for each of such failures? mnt_ns_tree handling is your code; do you see any problems with static void mnt_ns_tree_remove(struct mnt_namespace *ns) { /* remove from global mount namespace list */ if (!list_empty(&ns->mnt_ns_list)) { mnt_ns_tree_write_lock(); rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree); list_bidir_del_rcu(&ns->mnt_ns_list); mnt_ns_tree_write_unlock(); } call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu); } and mnt = __do_loopback(path, recursive); if (IS_ERR(mnt)) { emptied_ns = ns; namespace_unlock(); return ERR_CAST(mnt); } in open_detached_copy() and new = copy_tree(old, old->mnt.mnt_root, copy_flags); if (IS_ERR(new)) { emptied_ns = new_ns; namespace_unlock(); return ERR_CAST(new); } in copy_mnt_ns()? ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 02/52] introduced guards for mount_lock 2025-08-25 23:44 ` Al Viro @ 2025-08-26 1:44 ` Al Viro 0 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-26 1:44 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, jack, torvalds On Tue, Aug 26, 2025 at 12:44:13AM +0100, Al Viro wrote: > As for the second, couldn't we simply use !list_empty(&ns->mnt_ns_list) > as a condition? And avoiding an RCU delay... nice, in principle, but > the case when that would've saved us anything is CLONE_NEWNS clone(2) or > unshare(2) failing due to severe OOM. Do we give a damn about one extra > call_rcu() for each of such failures? > > mnt_ns_tree handling is your code; do you see any problems with ... this (on top of the posted series, needs to be carved into several parts - dropping pointless lock_mount_hash() in open_detached_copy(), making mnt_ns_tree_remove() and thus free_mnt_ns() safe to use on ns not in mnt_ns_tree yet, then dealing with open_detached_copy() and copy_mnt_ns() separately): diff --git a/fs/namespace.c b/fs/namespace.c index 63b74d7384fd..b77469789f82 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -195,7 +195,7 @@ static void mnt_ns_release_rcu(struct rcu_head *rcu) static void mnt_ns_tree_remove(struct mnt_namespace *ns) { /* remove from global mount namespace list */ - if (!is_anon_ns(ns)) { + if (!list_empty(&ns->mnt_ns_list)) { mnt_ns_tree_write_lock(); rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree); list_bidir_del_rcu(&ns->mnt_ns_list); @@ -3053,18 +3053,17 @@ static int do_loopback(const struct path *path, const char *old_name, return err; } -static struct file *open_detached_copy(struct path *path, bool recursive) +static struct mnt_namespace *get_detached_copy(const struct path *path, bool recursive) { struct mnt_namespace *ns, *mnt_ns = current->nsproxy->mnt_ns, *src_mnt_ns; struct user_namespace *user_ns = mnt_ns->user_ns; struct mount *mnt, *p; - struct file *file; ns = alloc_mnt_ns(user_ns, true); if (IS_ERR(ns)) - return ERR_CAST(ns); + return ns; - namespace_lock(); + guard(namespace_excl)(); /* * Record the sequence number of the source mount namespace. @@ -3081,23 +3080,28 @@ static struct file *open_detached_copy(struct path *path, bool recursive) mnt = __do_loopback(path, recursive); if (IS_ERR(mnt)) { - namespace_unlock(); - free_mnt_ns(ns); + emptied_ns = ns; return ERR_CAST(mnt); } - lock_mount_hash(); for (p = mnt; p; p = next_mnt(p, mnt)) { mnt_add_to_ns(ns, p); ns->nr_mounts++; } ns->root = mnt; - mntget(&mnt->mnt); - unlock_mount_hash(); - namespace_unlock(); + return ns; +} + +static struct file *open_detached_copy(struct path *path, bool recursive) +{ + struct mnt_namespace *ns = get_detached_copy(path, recursive); + struct file *file; + + if (IS_ERR(ns)) + return ERR_CAST(ns); mntput(path->mnt); - path->mnt = &mnt->mnt; + path->mnt = mntget(&ns->root->mnt); file = dentry_open(path, O_PATH, current_cred()); if (IS_ERR(file)) dissolve_on_fput(path->mnt); @@ -4165,7 +4169,8 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, struct user_namespace *user_ns, struct fs_struct *new_fs) { struct mnt_namespace *new_ns; - struct vfsmount *rootmnt = NULL, *pwdmnt = NULL; + struct vfsmount *rootmnt __free(mntput)= NULL; + struct vfsmount *pwdmnt __free(mntput) = NULL; struct mount *p, *q; struct mount *old; struct mount *new; @@ -4184,23 +4189,20 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, if (IS_ERR(new_ns)) return new_ns; - namespace_lock(); + guard(namespace_excl)(); /* First pass: copy the tree topology */ copy_flags = CL_COPY_UNBINDABLE | CL_EXPIRE; if (user_ns != ns->user_ns) copy_flags |= CL_SLAVE; new = copy_tree(old, old->mnt.mnt_root, copy_flags); if (IS_ERR(new)) { - namespace_unlock(); - ns_free_inum(&new_ns->ns); - dec_mnt_namespaces(new_ns->ucounts); - mnt_ns_release(new_ns); + emptied_ns = new_ns; return ERR_CAST(new); } + if (user_ns != ns->user_ns) { - lock_mount_hash(); - lock_mnt_tree(new); - unlock_mount_hash(); + scoped_guard(mount_writer) + lock_mnt_tree(new); } new_ns->root = new; @@ -4232,12 +4234,6 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, while (p->mnt.mnt_root != q->mnt.mnt_root) p = next_mnt(skip_mnt_tree(p), old); } - namespace_unlock(); - - if (rootmnt) - mntput(rootmnt); - if (pwdmnt) - mntput(pwdmnt); mnt_ns_tree_add(new_ns); return new_ns; ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 02/52] introduced guards for mount_lock 2025-08-25 20:21 ` Al Viro 2025-08-25 23:44 ` Al Viro @ 2025-08-26 15:17 ` Askar Safin 2025-08-26 15:45 ` Al Viro 1 sibling, 1 reply; 321+ messages in thread From: Askar Safin @ 2025-08-26 15:17 UTC (permalink / raw) To: viro; +Cc: brauner, jack, linux-fsdevel, torvalds Al Viro <viro@zeniv.linux.org.uk>: > When the last reference to > mount past the umount_tree() (i.e. already with NULL ->mnt_ns) goes away, anything > subtree stuck to it will be detached from it and have its root unhashed and dropped. > In other words, such tree (e.g. result of umount -l) decays from root to leaves - > once all references to root are gone, it's cut off and all pieces are left > to decay. That is done with mount_writer (has to be - there are mount hash changes > and for those mount_writer is a hard requirement) and only after the final reference > to root has been dropped. I'm unable to understand this. As well as I understand your text, when you unmount some directory /a using "umount -l /a", then /a and all its children will stay as long as there are references to /a . This contradicts to reality. Consider this: # mount -t tmpfs tmpfs /a # mkdir /a/b # mount -t tmpfs tmpfs /a/b # mkdir /a/b/c # cd /a # umount -l /a According to your text, both /a and /a/b will stay, because we have reference to /a (via our cwd). But in reality /a/b disappears immidiately (i. e. "ls b" shows nothing, as opposed to "c"). This happens even if I test with your patches applied. So, your explanation seems to be wrong. -- Askar Safin ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 02/52] introduced guards for mount_lock 2025-08-26 15:17 ` Askar Safin @ 2025-08-26 15:45 ` Al Viro 0 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-26 15:45 UTC (permalink / raw) To: Askar Safin; +Cc: brauner, jack, linux-fsdevel, torvalds On Tue, Aug 26, 2025 at 06:17:45PM +0300, Askar Safin wrote: > Al Viro <viro@zeniv.linux.org.uk>: > > When the last reference to > > mount past the umount_tree() (i.e. already with NULL ->mnt_ns) goes away, anything > > subtree stuck to it will be detached from it and have its root unhashed and dropped. > > In other words, such tree (e.g. result of umount -l) decays from root to leaves - > > once all references to root are gone, it's cut off and all pieces are left > > to decay. That is done with mount_writer (has to be - there are mount hash changes > > and for those mount_writer is a hard requirement) and only after the final reference > > to root has been dropped. > > I'm unable to understand this. > > As well as I understand your text, when you unmount some directory /a using "umount -l /a", then /a and > all its children will stay as long as there are references to /a . This contradicts to reality. > > Consider this: > > # mount -t tmpfs tmpfs /a > # mkdir /a/b > # mount -t tmpfs tmpfs /a/b > # mkdir /a/b/c > # cd /a > # umount -l /a > > According to your text, both /a and /a/b will stay, because we have reference to /a (via our cwd). > > But in reality /a/b disappears immidiately (i. e. "ls b" shows nothing, as opposed to "c"). > > This happens even if I test with your patches applied. > > So, your explanation seems to be wrong. Take a look at disconnect_mount(). For example, if mount is locked (== propagated across the userns boundary), it will remain stuck to its parent. ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 03/52] fs/namespace.c: allow to drop vfsmount references via __free(mntput) 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-08-25 4:43 ` [PATCH 02/52] introduced guards for mount_lock Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:33 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 04/52] __detach_mounts(): use guards Al Viro ` (49 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Note that just as path_put, it should never be done in scope of namespace_sem, be it shared or exclusive. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/namespace.c b/fs/namespace.c index fcea65587ff9..767ab751ee2a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -88,6 +88,8 @@ DEFINE_LOCK_GUARD_0(namespace_excl, namespace_lock(), namespace_unlock()) DEFINE_LOCK_GUARD_0(namespace_shared, down_read(&namespace_sem), up_read(&namespace_sem)) +DEFINE_FREE(mntput, struct vfsmount *, if (!IS_ERR(_T)) mntput(_T)) + #ifdef CONFIG_FSNOTIFY LIST_HEAD(notify_list); /* protected by namespace_sem */ #endif -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 03/52] fs/namespace.c: allow to drop vfsmount references via __free(mntput) 2025-08-25 4:43 ` [PATCH 03/52] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro @ 2025-08-25 12:33 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:33 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:06AM +0100, Al Viro wrote: > Note that just as path_put, it should never be done in scope of > namespace_sem, be it shared or exclusive. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 04/52] __detach_mounts(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-08-25 4:43 ` [PATCH 02/52] introduced guards for mount_lock Al Viro 2025-08-25 4:43 ` [PATCH 03/52] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:33 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 05/52] __is_local_mountpoint(): " Al Viro ` (48 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Clean fit for guards use; guards can't be weaker due to umount_tree() calls. --- fs/namespace.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 767ab751ee2a..1ae1ab8815c9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2032,10 +2032,11 @@ void __detach_mounts(struct dentry *dentry) struct pinned_mountpoint mp = {}; struct mount *mnt; - namespace_lock(); - lock_mount_hash(); + guard(namespace_excl)(); + guard(mount_writer)(); + if (!lookup_mountpoint(dentry, &mp)) - goto out_unlock; + return; event++; while (mp.node.next) { @@ -2047,9 +2048,6 @@ void __detach_mounts(struct dentry *dentry) else umount_tree(mnt, UMOUNT_CONNECTED); } unpin_mountpoint(&mp); -out_unlock: - unlock_mount_hash(); - namespace_unlock(); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 04/52] __detach_mounts(): use guards 2025-08-25 4:43 ` [PATCH 04/52] __detach_mounts(): use guards Al Viro @ 2025-08-25 12:33 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:33 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:07AM +0100, Al Viro wrote: > Clean fit for guards use; guards can't be weaker due to umount_tree() calls. > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 05/52] __is_local_mountpoint(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (2 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 04/52] __detach_mounts(): use guards Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:33 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 06/52] do_change_type(): " Al Viro ` (47 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_shared due to iterating through ns->mounts. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 1ae1ab8815c9..f1460ddd1486 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -906,17 +906,14 @@ bool __is_local_mountpoint(const struct dentry *dentry) { struct mnt_namespace *ns = current->nsproxy->mnt_ns; struct mount *mnt, *n; - bool is_covered = false; - down_read(&namespace_sem); - rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) { - is_covered = (mnt->mnt_mountpoint == dentry); - if (is_covered) - break; - } - up_read(&namespace_sem); + guard(namespace_shared)(); + + rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) + if (mnt->mnt_mountpoint == dentry) + return true; - return is_covered; + return false; } struct pinned_mountpoint { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 05/52] __is_local_mountpoint(): use guards 2025-08-25 4:43 ` [PATCH 05/52] __is_local_mountpoint(): " Al Viro @ 2025-08-25 12:33 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:33 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:08AM +0100, Al Viro wrote: > clean fit; namespace_shared due to iterating through ns->mounts. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 06/52] do_change_type(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (3 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 05/52] __is_local_mountpoint(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:34 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 07/52] do_set_group(): " Al Viro ` (46 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_excl to modify propagation graph Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index f1460ddd1486..a6a7b068770a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2899,7 +2899,7 @@ static int do_change_type(struct path *path, int ms_flags) struct mount *mnt = real_mount(path->mnt); int recurse = ms_flags & MS_REC; int type; - int err = 0; + int err; if (!path_mounted(path)) return -EINVAL; @@ -2908,23 +2908,22 @@ static int do_change_type(struct path *path, int ms_flags) if (!type) return -EINVAL; - namespace_lock(); + guard(namespace_excl)(); + err = may_change_propagation(mnt); if (err) - goto out_unlock; + return err; if (type == MS_SHARED) { err = invent_group_ids(mnt, recurse); if (err) - goto out_unlock; + return err; } for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL)) change_mnt_propagation(m, type); - out_unlock: - namespace_unlock(); - return err; + return 0; } /* may_copy_tree() - check if a mount tree can be copied -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 06/52] do_change_type(): use guards 2025-08-25 4:43 ` [PATCH 06/52] do_change_type(): " Al Viro @ 2025-08-25 12:34 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:34 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:09AM +0100, Al Viro wrote: > clean fit; namespace_excl to modify propagation graph > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 07/52] do_set_group(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (4 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 06/52] do_change_type(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:35 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 08/52] mark_mounts_for_expiry(): " Al Viro ` (45 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_excl to modify propagation graph Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a6a7b068770a..13e2f3837a26 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3349,47 +3349,44 @@ static inline int tree_contains_unbindable(struct mount *mnt) static int do_set_group(struct path *from_path, struct path *to_path) { - struct mount *from, *to; + struct mount *from = real_mount(from_path->mnt); + struct mount *to = real_mount(to_path->mnt); int err; - from = real_mount(from_path->mnt); - to = real_mount(to_path->mnt); - - namespace_lock(); + guard(namespace_excl)(); err = may_change_propagation(from); if (err) - goto out; + return err; err = may_change_propagation(to); if (err) - goto out; + return err; - err = -EINVAL; /* To and From paths should be mount roots */ if (!path_mounted(from_path)) - goto out; + return -EINVAL; if (!path_mounted(to_path)) - goto out; + return -EINVAL; /* Setting sharing groups is only allowed across same superblock */ if (from->mnt.mnt_sb != to->mnt.mnt_sb) - goto out; + return -EINVAL; /* From mount root should be wider than To mount root */ if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root)) - goto out; + return -EINVAL; /* From mount should not have locked children in place of To's root */ if (__has_locked_children(from, to->mnt.mnt_root)) - goto out; + return -EINVAL; /* Setting sharing groups is only allowed on private mounts */ if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to)) - goto out; + return -EINVAL; /* From should not be private */ if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from)) - goto out; + return -EINVAL; if (IS_MNT_SLAVE(from)) { hlist_add_behind(&to->mnt_slave, &from->mnt_slave); @@ -3401,11 +3398,7 @@ static int do_set_group(struct path *from_path, struct path *to_path) list_add(&to->mnt_share, &from->mnt_share); set_mnt_shared(to); } - - err = 0; -out: - namespace_unlock(); - return err; + return 0; } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 07/52] do_set_group(): use guards 2025-08-25 4:43 ` [PATCH 07/52] do_set_group(): " Al Viro @ 2025-08-25 12:35 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:35 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:10AM +0100, Al Viro wrote: > clean fit; namespace_excl to modify propagation graph > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 08/52] mark_mounts_for_expiry(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (5 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 07/52] do_set_group(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:37 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 09/52] put_mnt_ns(): " Al Viro ` (44 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Clean fit; guards can't be weaker due to umount_tree() calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 13e2f3837a26..898a6b7307e4 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3886,8 +3886,8 @@ void mark_mounts_for_expiry(struct list_head *mounts) if (list_empty(mounts)) return; - namespace_lock(); - lock_mount_hash(); + guard(namespace_excl)(); + guard(mount_writer)(); /* extract from the expiration list every vfsmount that matches the * following criteria: @@ -3909,8 +3909,6 @@ void mark_mounts_for_expiry(struct list_head *mounts) touch_mnt_namespace(mnt->mnt_ns); umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC); } - unlock_mount_hash(); - namespace_unlock(); } EXPORT_SYMBOL_GPL(mark_mounts_for_expiry); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 08/52] mark_mounts_for_expiry(): use guards 2025-08-25 4:43 ` [PATCH 08/52] mark_mounts_for_expiry(): " Al Viro @ 2025-08-25 12:37 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:37 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:11AM +0100, Al Viro wrote: > Clean fit; guards can't be weaker due to umount_tree() calls. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 09/52] put_mnt_ns(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (6 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 08/52] mark_mounts_for_expiry(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:37 ` Christian Brauner 2025-08-25 12:40 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 10/52] mnt_already_visible(): " Al Viro ` (43 subsequent siblings) 51 siblings, 2 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; guards can't be weaker due to umount_tree() call. Setting emptied_ns requires namespace_excl, but not anything mount_lock-related. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 898a6b7307e4..86a86be2b0ef 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6153,12 +6153,10 @@ void put_mnt_ns(struct mnt_namespace *ns) { if (!refcount_dec_and_test(&ns->ns.count)) return; - namespace_lock(); + guard(namespace_excl)(); emptied_ns = ns; - lock_mount_hash(); + guard(mount_writer)(); umount_tree(ns->root, 0); - unlock_mount_hash(); - namespace_unlock(); } struct vfsmount *kern_mount(struct file_system_type *type) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 09/52] put_mnt_ns(): use guards 2025-08-25 4:43 ` [PATCH 09/52] put_mnt_ns(): " Al Viro @ 2025-08-25 12:37 ` Christian Brauner 2025-08-25 12:40 ` Christian Brauner 1 sibling, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:37 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:12AM +0100, Al Viro wrote: > clean fit; guards can't be weaker due to umount_tree() call. > Setting emptied_ns requires namespace_excl, but not anything > mount_lock-related. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 09/52] put_mnt_ns(): use guards 2025-08-25 4:43 ` [PATCH 09/52] put_mnt_ns(): " Al Viro 2025-08-25 12:37 ` Christian Brauner @ 2025-08-25 12:40 ` Christian Brauner 2025-08-25 16:21 ` Al Viro 1 sibling, 1 reply; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:40 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:12AM +0100, Al Viro wrote: > clean fit; guards can't be weaker due to umount_tree() call. > Setting emptied_ns requires namespace_excl, but not anything > mount_lock-related. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/namespace.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index 898a6b7307e4..86a86be2b0ef 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -6153,12 +6153,10 @@ void put_mnt_ns(struct mnt_namespace *ns) > { > if (!refcount_dec_and_test(&ns->ns.count)) > return; > - namespace_lock(); > + guard(namespace_excl)(); > emptied_ns = ns; Another thing, did I miss commit aab771f34e63ef89e195b63d121abcb55eebfde6 Author: Al Viro <viro@zeniv.linux.org.uk> AuthorDate: Wed Jun 18 18:23:41 2025 -0400 Commit: Al Viro <viro@zeniv.linux.org.uk> CommitDate: Sun Jun 29 19:03:46 2025 -0400 take freeing of emptied mnt_namespace to namespace_unlock() on the list somehow? I just saw that "emptied_ns" thing for the first time and was very confused where that came from. I don't see any lore link attached to the commit message. ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 09/52] put_mnt_ns(): use guards 2025-08-25 12:40 ` Christian Brauner @ 2025-08-25 16:21 ` Al Viro 0 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 16:21 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 02:40:53PM +0200, Christian Brauner wrote: > Another thing, did I miss > > commit aab771f34e63ef89e195b63d121abcb55eebfde6 > Author: Al Viro <viro@zeniv.linux.org.uk> > AuthorDate: Wed Jun 18 18:23:41 2025 -0400 > Commit: Al Viro <viro@zeniv.linux.org.uk> > CommitDate: Sun Jun 29 19:03:46 2025 -0400 > > take freeing of emptied mnt_namespace to namespace_unlock() > > on the list somehow? I just saw that "emptied_ns" thing for the first > time and was very confused where that came from. I don't see any lore > link attached to the commit message. https://lore.kernel.org/all/20250623045428.1271612-35-viro@zeniv.linux.org.uk/ and https://lore.kernel.org/all/20250630025255.1387419-45-viro@zeniv.linux.org.uk/ in the next iteration of the same patchset, both Cc'd to you. As for the reasons, there are nasty hidden constraints caused by mount notifications; even though all mounts are out of that namespace, we can't free it until the calls of mnt_notify(), which come from notify_mnt_list(), from namespace_unlock(). Better handle it that way than have a recurring headache; besides, it helps with cleaning post-unlock_mount() stuff. ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 10/52] mnt_already_visible(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (7 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 09/52] put_mnt_ns(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:39 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 11/52] check_for_nsfs_mounts(): no need to take locks Al Viro ` (42 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_shared due to iterating through ns->mounts. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 86a86be2b0ef..a5d37b97088f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6232,9 +6232,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns, { int new_flags = *new_mnt_flags; struct mount *mnt, *n; - bool visible = false; - down_read(&namespace_sem); + guard(namespace_shared)(); rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) { struct mount *child; int mnt_flags; @@ -6281,13 +6280,10 @@ static bool mnt_already_visible(struct mnt_namespace *ns, /* Preserve the locked attributes */ *new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \ MNT_LOCK_ATIME); - visible = true; - goto found; + return true; next: ; } -found: - up_read(&namespace_sem); - return visible; + return false; } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 10/52] mnt_already_visible(): use guards 2025-08-25 4:43 ` [PATCH 10/52] mnt_already_visible(): " Al Viro @ 2025-08-25 12:39 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:39 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:13AM +0100, Al Viro wrote: > clean fit; namespace_shared due to iterating through ns->mounts. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 11/52] check_for_nsfs_mounts(): no need to take locks 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (8 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 10/52] mnt_already_visible(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:48 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 12/52] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro ` (41 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Currently we are taking mount_writer; what that function needs is either mount_locked_reader (we are not changing anything, we just want to iterate through the subtree) or namespace_shared and a reference held by caller on the root of subtree - that's also enough to stabilize the topology. The thing is, all callers are already holding at least namespace_shared as well as a reference to the root of subtree. Let's make the callers provide locking warranties - don't mess with mount_lock in check_for_nsfs_mounts() itself and document the locking requirements. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a5d37b97088f..59948cbf9c47 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2402,21 +2402,15 @@ bool has_locked_children(struct mount *mnt, struct dentry *dentry) * specified subtree. Such references can act as pins for mount namespaces * that aren't checked by the mount-cycle checking code, thereby allowing * cycles to be made. + * + * locks: mount_locked_reader || namespace_shared && pinned(subtree) */ static bool check_for_nsfs_mounts(struct mount *subtree) { - struct mount *p; - bool ret = false; - - lock_mount_hash(); - for (p = subtree; p; p = next_mnt(p, subtree)) + for (struct mount *p = subtree; p; p = next_mnt(p, subtree)) if (mnt_ns_loop(p->mnt.mnt_root)) - goto out; - - ret = true; -out: - unlock_mount_hash(); - return ret; + return false; + return true; } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 11/52] check_for_nsfs_mounts(): no need to take locks 2025-08-25 4:43 ` [PATCH 11/52] check_for_nsfs_mounts(): no need to take locks Al Viro @ 2025-08-25 12:48 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:48 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:14AM +0100, Al Viro wrote: > Currently we are taking mount_writer; what that function needs is > either mount_locked_reader (we are not changing anything, we just > want to iterate through the subtree) or namespace_shared and > a reference held by caller on the root of subtree - that's also > enough to stabilize the topology. > > The thing is, all callers are already holding at least namespace_shared > as well as a reference to the root of subtree. > > Let's make the callers provide locking warranties - don't mess with > mount_lock in check_for_nsfs_mounts() itself and document the locking > requirements. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 12/52] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (9 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 11/52] check_for_nsfs_mounts(): no need to take locks Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:49 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 13/52] has_locked_children(): use guards Al Viro ` (40 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/pnode.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/pnode.c b/fs/pnode.c index 6f7d02f3fa98..0702d45d856d 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -304,9 +304,8 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp, err = PTR_ERR(this); break; } - read_seqlock_excl(&mount_lock); - mnt_set_mountpoint(n, dest_mp, this); - read_sequnlock_excl(&mount_lock); + scoped_guard(mount_locked_reader) + mnt_set_mountpoint(n, dest_mp, this); if (n->mnt_master) SET_MNT_MARK(n->mnt_master); copy = this; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 12/52] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() 2025-08-25 4:43 ` [PATCH 12/52] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro @ 2025-08-25 12:49 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:49 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:15AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 13/52] has_locked_children(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (10 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 12/52] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 11:54 ` Linus Torvalds 2025-08-25 12:49 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 14/52] mnt_set_expiry(): " Al Viro ` (39 subsequent siblings) 51 siblings, 2 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and document the locking requirements of __has_locked_children() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 59948cbf9c47..eabb0d996c6a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2373,6 +2373,7 @@ void dissolve_on_fput(struct vfsmount *mnt) } } +/* locks: namespace_shared && pinned(mnt) || mount_locked_reader */ static bool __has_locked_children(struct mount *mnt, struct dentry *dentry) { struct mount *child; @@ -2389,12 +2390,8 @@ static bool __has_locked_children(struct mount *mnt, struct dentry *dentry) bool has_locked_children(struct mount *mnt, struct dentry *dentry) { - bool res; - - read_seqlock_excl(&mount_lock); - res = __has_locked_children(mnt, dentry); - read_sequnlock_excl(&mount_lock); - return res; + scoped_guard(mount_locked_reader) + return __has_locked_children(mnt, dentry); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 13/52] has_locked_children(): use guards 2025-08-25 4:43 ` [PATCH 13/52] has_locked_children(): use guards Al Viro @ 2025-08-25 11:54 ` Linus Torvalds 2025-08-25 17:33 ` Al Viro 2025-08-25 12:49 ` Christian Brauner 1 sibling, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-08-25 11:54 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, brauner, jack [ diff edited to be just the end result ] On Mon, 25 Aug 2025 at 00:44, Al Viro <viro@zeniv.linux.org.uk> wrote: > > bool has_locked_children(struct mount *mnt, struct dentry *dentry) > { > + scoped_guard(mount_locked_reader) > + return __has_locked_children(mnt, dentry); > } So the use of scoped_guard() looks a bit odd to me. Why create a new scope for when the existing scope is identical? It would seem to be more straightforward to just do guard(mount_locked_reader); return __has_locked_children(mnt, dentry); instead. Was there some code generation issue or other thing that made you go the 'scoped' way? There was at least one other patch that did the same pattern (but I haven't gone through the whole series, maybe there are explanations later). Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 13/52] has_locked_children(): use guards 2025-08-25 11:54 ` Linus Torvalds @ 2025-08-25 17:33 ` Al Viro 0 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 17:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-fsdevel, brauner, jack On Mon, Aug 25, 2025 at 07:54:45AM -0400, Linus Torvalds wrote: > [ diff edited to be just the end result ] > > On Mon, 25 Aug 2025 at 00:44, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > bool has_locked_children(struct mount *mnt, struct dentry *dentry) > > { > > + scoped_guard(mount_locked_reader) > > + return __has_locked_children(mnt, dentry); > > } > > So the use of scoped_guard() looks a bit odd to me. Why create a new > scope for when the existing scope is identical? It would seem to be > more straightforward to just do > > guard(mount_locked_reader); > return __has_locked_children(mnt, dentry); > > instead. Was there some code generation issue or other thing that made > you go the 'scoped' way? > > There was at least one other patch that did the same pattern (but I > haven't gone through the whole series, maybe there are explanations > later). TBH, the main reason is that my mental model for that is with_lock: lock -> m X -> m X pardon the pseudo-Haskell. IOW, "wrap that sequence of operations into this lock". Oh, well - I can live with open-ended scope in a function that small and that unlikely to grow more stuff in it... ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 13/52] has_locked_children(): use guards 2025-08-25 4:43 ` [PATCH 13/52] has_locked_children(): use guards Al Viro 2025-08-25 11:54 ` Linus Torvalds @ 2025-08-25 12:49 ` Christian Brauner 1 sibling, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:49 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:16AM +0100, Al Viro wrote: > ... and document the locking requirements of __has_locked_children() > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/namespace.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index 59948cbf9c47..eabb0d996c6a 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2373,6 +2373,7 @@ void dissolve_on_fput(struct vfsmount *mnt) > } > } > > +/* locks: namespace_shared && pinned(mnt) || mount_locked_reader */ > static bool __has_locked_children(struct mount *mnt, struct dentry *dentry) > { > struct mount *child; > @@ -2389,12 +2390,8 @@ static bool __has_locked_children(struct mount *mnt, struct dentry *dentry) > > bool has_locked_children(struct mount *mnt, struct dentry *dentry) > { > - bool res; > - > - read_seqlock_excl(&mount_lock); > - res = __has_locked_children(mnt, dentry); > - read_sequnlock_excl(&mount_lock); > - return res; > + scoped_guard(mount_locked_reader) > + return __has_locked_children(mnt, dentry); Agree with Linus, this should just use a plain guard(). ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 14/52] mnt_set_expiry(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (11 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 13/52] has_locked_children(): use guards Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:51 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 15/52] path_is_under(): " Al Viro ` (38 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds The reason why it needs only mount_locked_reader is that there's no lockless accesses of expiry lists. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index eabb0d996c6a..acacfe767a7c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3858,9 +3858,8 @@ int finish_automount(struct vfsmount *m, const struct path *path) */ void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list) { - read_seqlock_excl(&mount_lock); - list_add_tail(&real_mount(mnt)->mnt_expire, expiry_list); - read_sequnlock_excl(&mount_lock); + scoped_guard(mount_locked_reader) + list_add_tail(&real_mount(mnt)->mnt_expire, expiry_list); } EXPORT_SYMBOL(mnt_set_expiry); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 14/52] mnt_set_expiry(): use guards 2025-08-25 4:43 ` [PATCH 14/52] mnt_set_expiry(): " Al Viro @ 2025-08-25 12:51 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:51 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:17AM +0100, Al Viro wrote: > The reason why it needs only mount_locked_reader is that there's no lockless > accesses of expiry lists. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/namespace.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index eabb0d996c6a..acacfe767a7c 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -3858,9 +3858,8 @@ int finish_automount(struct vfsmount *m, const struct path *path) > */ > void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list) > { > - read_seqlock_excl(&mount_lock); > - list_add_tail(&real_mount(mnt)->mnt_expire, expiry_list); > - read_sequnlock_excl(&mount_lock); > + scoped_guard(mount_locked_reader) > + list_add_tail(&real_mount(mnt)->mnt_expire, expiry_list); Should also just use a guard(). I don't think religiously sticking to scoped_guard() out of conceptual aversion to guard() buys us anything. It's cleaner to read in such short functions very clearly. ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 15/52] path_is_under(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (12 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 14/52] mnt_set_expiry(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:56 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 16/52] current_chrooted(): don't bother with follow_down_one() Al Viro ` (37 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and document that locking requirements for is_path_reachable(). There is one questionable caller in do_listmount() where we are not holding mount_lock *and* might not have the first argument mounted. However, in that case it will immediately return true without having to look at the ancestors. Might be cleaner to move the check into non-LSTM_ROOT case which it really belongs in - there the check is not always true and is_mounted() is guaranteed. Document the locking environments for is_path_reachable() callers: get_peer_under_root() get_dominating_id() do_statmount() do_listmount() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 12 ++++++------ fs/pnode.c | 3 ++- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index acacfe767a7c..bf9a3a644faa 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4592,7 +4592,7 @@ SYSCALL_DEFINE5(move_mount, /* * Return true if path is reachable from root * - * namespace_sem or mount_lock is held + * locks: mount_locked_reader || namespace_shared && is_mounted(mnt) */ bool is_path_reachable(struct mount *mnt, struct dentry *dentry, const struct path *root) @@ -4606,11 +4606,9 @@ bool is_path_reachable(struct mount *mnt, struct dentry *dentry, bool path_is_under(const struct path *path1, const struct path *path2) { - bool res; - read_seqlock_excl(&mount_lock); - res = is_path_reachable(real_mount(path1->mnt), path1->dentry, path2); - read_sequnlock_excl(&mount_lock); - return res; + scoped_guard(mount_locked_reader) + return is_path_reachable(real_mount(path1->mnt), path1->dentry, + path2); } EXPORT_SYMBOL(path_is_under); @@ -5689,6 +5687,7 @@ static int grab_requested_root(struct mnt_namespace *ns, struct path *root) STATMOUNT_MNT_UIDMAP | \ STATMOUNT_MNT_GIDMAP) +/* locks: namespace_shared */ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id, struct mnt_namespace *ns) { @@ -5949,6 +5948,7 @@ SYSCALL_DEFINE4(statmount, const struct mnt_id_req __user *, req, return ret; } +/* locks: namespace_shared */ static ssize_t do_listmount(struct mnt_namespace *ns, u64 mnt_parent_id, u64 last_mnt_id, u64 *mnt_ids, size_t nr_mnt_ids, bool reverse) diff --git a/fs/pnode.c b/fs/pnode.c index 0702d45d856d..edaf9d9d0eaf 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -29,6 +29,7 @@ static inline struct mount *next_slave(struct mount *p) return hlist_entry(p->mnt_slave.next, struct mount, mnt_slave); } +/* locks: namespace_shared && is_mounted(mnt) */ static struct mount *get_peer_under_root(struct mount *mnt, struct mnt_namespace *ns, const struct path *root) @@ -50,7 +51,7 @@ static struct mount *get_peer_under_root(struct mount *mnt, * Get ID of closest dominating peer group having a representative * under the given root. * - * Caller must hold namespace_sem + * locks: namespace_shared */ int get_dominating_id(struct mount *mnt, const struct path *root) { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 15/52] path_is_under(): use guards 2025-08-25 4:43 ` [PATCH 15/52] path_is_under(): " Al Viro @ 2025-08-25 12:56 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:56 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:18AM +0100, Al Viro wrote: > ... and document that locking requirements for is_path_reachable(). > There is one questionable caller in do_listmount() where we are not > holding mount_lock *and* might not have the first argument mounted. > However, in that case it will immediately return true without having > to look at the ancestors. Might be cleaner to move the check into > non-LSTM_ROOT case which it really belongs in - there the check is > not always true and is_mounted() is guaranteed. > > Document the locking environments for is_path_reachable() callers: > get_peer_under_root() > get_dominating_id() > do_statmount() > do_listmount() > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> > fs/namespace.c | 12 ++++++------ > fs/pnode.c | 3 ++- > 2 files changed, 8 insertions(+), 7 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index acacfe767a7c..bf9a3a644faa 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -4592,7 +4592,7 @@ SYSCALL_DEFINE5(move_mount, > /* > * Return true if path is reachable from root > * > - * namespace_sem or mount_lock is held > + * locks: mount_locked_reader || namespace_shared && is_mounted(mnt) > */ > bool is_path_reachable(struct mount *mnt, struct dentry *dentry, > const struct path *root) > @@ -4606,11 +4606,9 @@ bool is_path_reachable(struct mount *mnt, struct dentry *dentry, > > bool path_is_under(const struct path *path1, const struct path *path2) > { > - bool res; > - read_seqlock_excl(&mount_lock); > - res = is_path_reachable(real_mount(path1->mnt), path1->dentry, path2); > - read_sequnlock_excl(&mount_lock); > - return res; > + scoped_guard(mount_locked_reader) > + return is_path_reachable(real_mount(path1->mnt), path1->dentry, > + path2); Same thing, no need for this scoped guard eyesore. ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 16/52] current_chrooted(): don't bother with follow_down_one() 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (13 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 15/52] path_is_under(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:57 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 17/52] current_chrooted(): use guards Al Viro ` (36 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds All we need here is to follow ->overmount on root mount of namespace... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index bf9a3a644faa..107da30b408c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6195,24 +6195,22 @@ bool our_mnt(struct vfsmount *mnt) bool current_chrooted(void) { /* Does the current process have a non-standard root */ - struct path ns_root; + struct mount *root = current->nsproxy->mnt_ns->root; struct path fs_root; bool chrooted; + get_fs_root(current->fs, &fs_root); + /* Find the namespace root */ - ns_root.mnt = ¤t->nsproxy->mnt_ns->root->mnt; - ns_root.dentry = ns_root.mnt->mnt_root; - path_get(&ns_root); - while (d_mountpoint(ns_root.dentry) && follow_down_one(&ns_root)) - ; + read_seqlock_excl(&mount_lock); - get_fs_root(current->fs, &fs_root); + while (unlikely(root->overmount)) + root = root->overmount; - chrooted = !path_equal(&fs_root, &ns_root); + chrooted = fs_root.mnt != &root->mnt || !path_mounted(&fs_root); + read_sequnlock_excl(&mount_lock); path_put(&fs_root); - path_put(&ns_root); - return chrooted; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 16/52] current_chrooted(): don't bother with follow_down_one() 2025-08-25 4:43 ` [PATCH 16/52] current_chrooted(): don't bother with follow_down_one() Al Viro @ 2025-08-25 12:57 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:57 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:19AM +0100, Al Viro wrote: > All we need here is to follow ->overmount on root mount of namespace... > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 17/52] current_chrooted(): use guards 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (14 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 16/52] current_chrooted(): don't bother with follow_down_one() Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:57 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 18/52] do_move_mount(): trim local variables Al Viro ` (35 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds here a use of __free(path_put) for dropping fs_root is enough to make guard(mount_locked_reader) fit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 107da30b408c..a8b586e635d8 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6195,23 +6195,20 @@ bool our_mnt(struct vfsmount *mnt) bool current_chrooted(void) { /* Does the current process have a non-standard root */ - struct mount *root = current->nsproxy->mnt_ns->root; - struct path fs_root; - bool chrooted; + struct path fs_root __free(path_put) = {}; + struct mount *root; get_fs_root(current->fs, &fs_root); /* Find the namespace root */ - read_seqlock_excl(&mount_lock); + guard(mount_locked_reader)(); + + root = current->nsproxy->mnt_ns->root; while (unlikely(root->overmount)) root = root->overmount; - chrooted = fs_root.mnt != &root->mnt || !path_mounted(&fs_root); - - read_sequnlock_excl(&mount_lock); - path_put(&fs_root); - return chrooted; + return fs_root.mnt != &root->mnt || !path_mounted(&fs_root); } static bool mnt_already_visible(struct mnt_namespace *ns, -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 17/52] current_chrooted(): use guards 2025-08-25 4:43 ` [PATCH 17/52] current_chrooted(): use guards Al Viro @ 2025-08-25 12:57 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:57 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:20AM +0100, Al Viro wrote: > here a use of __free(path_put) for dropping fs_root is enough to > make guard(mount_locked_reader) fit... > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 18/52] do_move_mount(): trim local variables 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (15 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 17/52] current_chrooted(): use guards Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:57 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 19/52] do_move_mount(): deal with the checks on old_path early Al Viro ` (34 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Both 'parent' and 'ns' are used at most once, no point precalculating those... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a8b586e635d8..1a076aac5d73 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3564,10 +3564,8 @@ static inline bool may_use_mount(struct mount *mnt) static int do_move_mount(struct path *old_path, struct path *new_path, enum mnt_tree_flags_t flags) { - struct mnt_namespace *ns; struct mount *p; struct mount *old; - struct mount *parent; struct pinned_mountpoint mp; int err; bool beneath = flags & MNT_TREE_BENEATH; @@ -3578,8 +3576,6 @@ static int do_move_mount(struct path *old_path, old = real_mount(old_path->mnt); p = real_mount(new_path->mnt); - parent = old->mnt_parent; - ns = old->mnt_ns; err = -EINVAL; @@ -3588,12 +3584,12 @@ static int do_move_mount(struct path *old_path, /* ... it should be detachable from parent */ if (!mnt_has_parent(old) || IS_MNT_LOCKED(old)) goto out; + /* ... which should not be shared */ + if (IS_MNT_SHARED(old->mnt_parent)) + goto out; /* ... and the target should be in our namespace */ if (!check_mnt(p)) goto out; - /* parent of the source should not be shared */ - if (IS_MNT_SHARED(parent)) - goto out; } else { /* * otherwise the source must be the root of some anon namespace. @@ -3605,7 +3601,7 @@ static int do_move_mount(struct path *old_path, * subsequent checks would've rejected that, but they lose * some corner cases if we check it early. */ - if (ns == p->mnt_ns) + if (old->mnt_ns == p->mnt_ns) goto out; /* * Target should be either in our namespace or in an acceptable -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 18/52] do_move_mount(): trim local variables 2025-08-25 4:43 ` [PATCH 18/52] do_move_mount(): trim local variables Al Viro @ 2025-08-25 12:57 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:57 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:21AM +0100, Al Viro wrote: > Both 'parent' and 'ns' are used at most once, no point precalculating those... > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> > fs/namespace.c | 12 ++++-------- > 1 file changed, 4 insertions(+), 8 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index a8b586e635d8..1a076aac5d73 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -3564,10 +3564,8 @@ static inline bool may_use_mount(struct mount *mnt) > static int do_move_mount(struct path *old_path, > struct path *new_path, enum mnt_tree_flags_t flags) > { > - struct mnt_namespace *ns; > struct mount *p; > struct mount *old; > - struct mount *parent; > struct pinned_mountpoint mp; > int err; > bool beneath = flags & MNT_TREE_BENEATH; > @@ -3578,8 +3576,6 @@ static int do_move_mount(struct path *old_path, > > old = real_mount(old_path->mnt); > p = real_mount(new_path->mnt); > - parent = old->mnt_parent; > - ns = old->mnt_ns; > > err = -EINVAL; > > @@ -3588,12 +3584,12 @@ static int do_move_mount(struct path *old_path, > /* ... it should be detachable from parent */ > if (!mnt_has_parent(old) || IS_MNT_LOCKED(old)) > goto out; > + /* ... which should not be shared */ > + if (IS_MNT_SHARED(old->mnt_parent)) > + goto out; > /* ... and the target should be in our namespace */ > if (!check_mnt(p)) > goto out; > - /* parent of the source should not be shared */ > - if (IS_MNT_SHARED(parent)) > - goto out; > } else { > /* > * otherwise the source must be the root of some anon namespace. > @@ -3605,7 +3601,7 @@ static int do_move_mount(struct path *old_path, > * subsequent checks would've rejected that, but they lose > * some corner cases if we check it early. > */ > - if (ns == p->mnt_ns) > + if (old->mnt_ns == p->mnt_ns) > goto out; > /* > * Target should be either in our namespace or in an acceptable > -- > 2.47.2 > ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 19/52] do_move_mount(): deal with the checks on old_path early 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (16 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 18/52] do_move_mount(): trim local variables Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:00 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro ` (33 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds 1) checking that location we want to move does point to root of some mount can be done before anything else; that property is not going to change and having it already verified simplifies the analysis. 2) checking the type agreement between what we are trying to move and what we are trying to move it onto also belongs in the very beginning - do_lock_mount() might end up switching new_path to something that overmounts the original location, but... the same type agreement applies to overmounts, so we could just as well check against the original location. 3) since we know that old_path->dentry is the root of old_path->mnt, there's no point bothering with path_is_overmounted() in can_move_mount_beneath(); it's simply a check for the mount we are trying to move having non-NULL ->overmount. And with that, we can switch can_move_mount_beneath() to taking old instead of old_path, leaving no uses of old_path past the original checks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 1a076aac5d73..42ef0d0c3d40 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3433,7 +3433,7 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) /** * can_move_mount_beneath - check that we can mount beneath the top mount - * @from: mount to mount beneath + * @mnt_from: mount we are trying to move * @to: mount under which to mount * @mp: mountpoint of @to * @@ -3443,7 +3443,7 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * root or the rootfs of the namespace. * - Make sure that the caller can unmount the topmost mount ensuring * that the caller could reveal the underlying mountpoint. - * - Ensure that nothing has been mounted on top of @from before we + * - Ensure that nothing has been mounted on top of @mnt_from before we * grabbed @namespace_sem to avoid creating pointless shadow mounts. * - Prevent mounting beneath a mount if the propagation relationship * between the source mount, parent mount, and top mount would lead to @@ -3452,12 +3452,11 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * Context: This function expects namespace_lock() to be held. * Return: On success 0, and on error a negative error code is returned. */ -static int can_move_mount_beneath(const struct path *from, +static int can_move_mount_beneath(struct mount *mnt_from, const struct path *to, const struct mountpoint *mp) { - struct mount *mnt_from = real_mount(from->mnt), - *mnt_to = real_mount(to->mnt), + struct mount *mnt_to = real_mount(to->mnt), *parent_mnt_to = mnt_to->mnt_parent; if (!mnt_has_parent(mnt_to)) @@ -3470,7 +3469,7 @@ static int can_move_mount_beneath(const struct path *from, return -EINVAL; /* Avoid creating shadow mounts during mount propagation. */ - if (path_overmounted(from)) + if (mnt_from->overmount) return -EINVAL; /* @@ -3565,16 +3564,21 @@ static int do_move_mount(struct path *old_path, struct path *new_path, enum mnt_tree_flags_t flags) { struct mount *p; - struct mount *old; + struct mount *old = real_mount(old_path->mnt); struct pinned_mountpoint mp; int err; bool beneath = flags & MNT_TREE_BENEATH; + if (!path_mounted(old_path)) + return -EINVAL; + + if (d_is_dir(new_path->dentry) != d_is_dir(old_path->dentry)) + return -EINVAL; + err = do_lock_mount(new_path, &mp, beneath); if (err) return err; - old = real_mount(old_path->mnt); p = real_mount(new_path->mnt); err = -EINVAL; @@ -3611,15 +3615,8 @@ static int do_move_mount(struct path *old_path, goto out; } - if (!path_mounted(old_path)) - goto out; - - if (d_is_dir(new_path->dentry) != - d_is_dir(old_path->dentry)) - goto out; - if (beneath) { - err = can_move_mount_beneath(old_path, new_path, mp.mp); + err = can_move_mount_beneath(old, new_path, mp.mp); if (err) goto out; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 19/52] do_move_mount(): deal with the checks on old_path early 2025-08-25 4:43 ` [PATCH 19/52] do_move_mount(): deal with the checks on old_path early Al Viro @ 2025-08-25 13:00 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:00 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:22AM +0100, Al Viro wrote: > 1) checking that location we want to move does point to root of some mount > can be done before anything else; that property is not going to change > and having it already verified simplifies the analysis. > > 2) checking the type agreement between what we are trying to move and what > we are trying to move it onto also belongs in the very beginning - > do_lock_mount() might end up switching new_path to something that overmounts > the original location, but... the same type agreement applies to overmounts, > so we could just as well check against the original location. > > 3) since we know that old_path->dentry is the root of old_path->mnt, there's > no point bothering with path_is_overmounted() in can_move_mount_beneath(); > it's simply a check for the mount we are trying to move having non-NULL > ->overmount. And with that, we can switch can_move_mount_beneath() to > taking old instead of old_path, leaving no uses of old_path past the original > checks. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (17 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 19/52] do_move_mount(): deal with the checks on old_path early Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 12:10 ` Linus Torvalds 2025-08-25 13:02 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 21/52] finish_automount(): simplify the ELOOP check Al Viro ` (32 subsequent siblings) 51 siblings, 2 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds We want to mount beneath the given location. For that operation to make sense, location must be the root of some mount that has something under it. Currently we let it proceed if those requirements are not met, with rather meaningless results, and have that bogosity caught further down the road; let's fail early instead - do_lock_mount() doesn't make sense unless those conditions hold, and checking them there makes things simpler. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 42ef0d0c3d40..9e04133d81dd 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2768,12 +2768,19 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo struct path under = {}; int err = -ENOENT; + if (unlikely(beneath) && !path_mounted(path)) + return -EINVAL; + for (;;) { struct mount *m = real_mount(mnt); if (beneath) { path_put(&under); read_seqlock_excl(&mount_lock); + if (unlikely(!mnt_has_parent(m))) { + read_sequnlock_excl(&mount_lock); + return -EINVAL; + } under.mnt = mntget(&m->mnt_parent->mnt); under.dentry = dget(m->mnt_mountpoint); read_sequnlock_excl(&mount_lock); @@ -3437,8 +3444,6 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * @to: mount under which to mount * @mp: mountpoint of @to * - * - Make sure that @to->dentry is actually the root of a mount under - * which we can mount another mount. * - Make sure that nothing can be mounted beneath the caller's current * root or the rootfs of the namespace. * - Make sure that the caller can unmount the topmost mount ensuring @@ -3459,12 +3464,6 @@ static int can_move_mount_beneath(struct mount *mnt_from, struct mount *mnt_to = real_mount(to->mnt), *parent_mnt_to = mnt_to->mnt_parent; - if (!mnt_has_parent(mnt_to)) - return -EINVAL; - - if (!path_mounted(to)) - return -EINVAL; - if (IS_MNT_LOCKED(mnt_to)) return -EINVAL; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() 2025-08-25 4:43 ` [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro @ 2025-08-25 12:10 ` Linus Torvalds 2025-08-25 12:17 ` Linus Torvalds 2025-08-25 13:02 ` Christian Brauner 1 sibling, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-08-25 12:10 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, brauner, jack On Mon, 25 Aug 2025 at 00:45, Al Viro <viro@zeniv.linux.org.uk> wrote: > > if (beneath) { > path_put(&under); > read_seqlock_excl(&mount_lock); > + if (unlikely(!mnt_has_parent(m))) { > + read_sequnlock_excl(&mount_lock); > + return -EINVAL; > + } > under.mnt = mntget(&m->mnt_parent->mnt); > under.dentry = dget(m->mnt_mountpoint); > read_sequnlock_excl(&mount_lock); Well, *this* would look a lot cleaner with a "scoped_guard(mount_locked_reader)", but you didn't do that for some reason. Am I missing something? Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() 2025-08-25 12:10 ` Linus Torvalds @ 2025-08-25 12:17 ` Linus Torvalds 0 siblings, 0 replies; 321+ messages in thread From: Linus Torvalds @ 2025-08-25 12:17 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, brauner, jack On Mon, 25 Aug 2025 at 08:10, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Well, *this* would look a lot cleaner with a > "scoped_guard(mount_locked_reader)", but you didn't do that for some > reason. Am I missing something? Ahh. You rewrite it to look very different in 34/52. Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() 2025-08-25 4:43 ` [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro 2025-08-25 12:10 ` Linus Torvalds @ 2025-08-25 13:02 ` Christian Brauner 2025-08-25 16:05 ` Al Viro 1 sibling, 1 reply; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:02 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:23AM +0100, Al Viro wrote: > We want to mount beneath the given location. For that operation to > make sense, location must be the root of some mount that has something > under it. Currently we let it proceed if those requirements are not met, > with rather meaningless results, and have that bogosity caught further > down the road; let's fail early instead - do_lock_mount() doesn't make > sense unless those conditions hold, and checking them there makes > things simpler. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Well, do_lock_mount() was already convoluted enough that didn't want that in there as well. But I don't care, Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() 2025-08-25 13:02 ` Christian Brauner @ 2025-08-25 16:05 ` Al Viro 0 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 16:05 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 03:02:09PM +0200, Christian Brauner wrote: > On Mon, Aug 25, 2025 at 05:43:23AM +0100, Al Viro wrote: > > We want to mount beneath the given location. For that operation to > > make sense, location must be the root of some mount that has something > > under it. Currently we let it proceed if those requirements are not met, > > with rather meaningless results, and have that bogosity caught further > > down the road; let's fail early instead - do_lock_mount() doesn't make > > sense unless those conditions hold, and checking them there makes > > things simpler. > > > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > --- > > Well, do_lock_mount() was already convoluted enough that didn't want > that in there as well. But I don't care, It helps when it comes to cleaning it up - look at the condition it's in after 34/52... ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 21/52] finish_automount(): simplify the ELOOP check 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (18 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:02 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 22/52] do_loopback(): use __free(path_put) to deal with old_path Al Viro ` (31 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds It's enough to check that dentries match; if path->dentry is equal to m->mnt_root, superblocks will match as well. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9e04133d81dd..5c4b4f25b5f8 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3803,8 +3803,7 @@ int finish_automount(struct vfsmount *m, const struct path *path) mnt = real_mount(m); - if (m->mnt_sb == path->mnt->mnt_sb && - m->mnt_root == dentry) { + if (m->mnt_root == path->dentry) { err = -ELOOP; goto discard; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 21/52] finish_automount(): simplify the ELOOP check 2025-08-25 4:43 ` [PATCH 21/52] finish_automount(): simplify the ELOOP check Al Viro @ 2025-08-25 13:02 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:02 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:24AM +0100, Al Viro wrote: > It's enough to check that dentries match; if path->dentry is equal to > m->mnt_root, superblocks will match as well. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 22/52] do_loopback(): use __free(path_put) to deal with old_path 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (19 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 21/52] finish_automount(): simplify the ELOOP check Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:02 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 23/52] pivot_root(2): use __free() to deal with struct path in it Al Viro ` (30 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds preparations for making unlock_mount() a __cleanup(); can't have path_put() inside mount_lock scope. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 5c4b4f25b5f8..602612cbd095 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3014,7 +3014,7 @@ static struct mount *__do_loopback(struct path *old_path, int recurse) static int do_loopback(struct path *path, const char *old_name, int recurse) { - struct path old_path; + struct path old_path __free(path_put) = {}; struct mount *mnt = NULL, *parent; struct pinned_mountpoint mp = {}; int err; @@ -3024,13 +3024,12 @@ static int do_loopback(struct path *path, const char *old_name, if (err) return err; - err = -EINVAL; if (mnt_ns_loop(old_path.dentry)) - goto out; + return -EINVAL; err = lock_mount(path, &mp); if (err) - goto out; + return err; parent = real_mount(path->mnt); if (!check_mnt(parent)) @@ -3050,8 +3049,6 @@ static int do_loopback(struct path *path, const char *old_name, } out2: unlock_mount(&mp); -out: - path_put(&old_path); return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 22/52] do_loopback(): use __free(path_put) to deal with old_path 2025-08-25 4:43 ` [PATCH 22/52] do_loopback(): use __free(path_put) to deal with old_path Al Viro @ 2025-08-25 13:02 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:02 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:25AM +0100, Al Viro wrote: > preparations for making unlock_mount() a __cleanup(); > can't have path_put() inside mount_lock scope. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 23/52] pivot_root(2): use __free() to deal with struct path in it 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (20 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 22/52] do_loopback(): use __free(path_put) to deal with old_path Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:03 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 24/52] finish_automount(): take the lock_mount() analogue into a helper Al Viro ` (29 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds preparations for making unlock_mount() a __cleanup(); can't have path_put() inside mount_lock scope. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 602612cbd095..892251663419 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4628,7 +4628,9 @@ EXPORT_SYMBOL(path_is_under); SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, const char __user *, put_old) { - struct path new, old, root; + struct path new __free(path_put) = {}; + struct path old __free(path_put) = {}; + struct path root __free(path_put) = {}; struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent; struct pinned_mountpoint old_mp = {}; int error; @@ -4639,21 +4641,21 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, error = user_path_at(AT_FDCWD, new_root, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new); if (error) - goto out0; + return error; error = user_path_at(AT_FDCWD, put_old, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old); if (error) - goto out1; + return error; error = security_sb_pivotroot(&old, &new); if (error) - goto out2; + return error; get_fs_root(current->fs, &root); error = lock_mount(&old, &old_mp); if (error) - goto out3; + return error; error = -EINVAL; new_mnt = real_mount(new.mnt); @@ -4711,13 +4713,6 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, error = 0; out4: unlock_mount(&old_mp); -out3: - path_put(&root); -out2: - path_put(&old); -out1: - path_put(&new); -out0: return error; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 23/52] pivot_root(2): use __free() to deal with struct path in it 2025-08-25 4:43 ` [PATCH 23/52] pivot_root(2): use __free() to deal with struct path in it Al Viro @ 2025-08-25 13:03 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:03 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:26AM +0100, Al Viro wrote: > preparations for making unlock_mount() a __cleanup(); > can't have path_put() inside mount_lock scope. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 24/52] finish_automount(): take the lock_mount() analogue into a helper 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (21 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 23/52] pivot_root(2): use __free() to deal with struct path in it Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:08 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure Al Viro ` (28 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds finish_automount() can't use lock_mount() - it treats finding something already mounted as "quitely drop our mount and return 0", not as "mount on top of whatever mounted there". It's been open-coded; let's take it into a helper similar to lock_mount(). "something's already mounted" => -EBUSY, finish_automount() needs to distinguish it from the normal case and it can't happen in other failure cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 42 +++++++++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 892251663419..99757040a39a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3786,9 +3786,29 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, return err; } -int finish_automount(struct vfsmount *m, const struct path *path) +static int lock_mount_exact(const struct path *path, + struct pinned_mountpoint *mp) { struct dentry *dentry = path->dentry; + int err; + + inode_lock(dentry->d_inode); + namespace_lock(); + if (unlikely(cant_mount(dentry))) + err = -ENOENT; + else if (path_overmounted(path)) + err = -EBUSY; + else + err = get_mountpoint(dentry, mp); + if (unlikely(err)) { + namespace_unlock(); + inode_unlock(dentry->d_inode); + } + return err; +} + +int finish_automount(struct vfsmount *m, const struct path *path) +{ struct pinned_mountpoint mp = {}; struct mount *mnt; int err; @@ -3810,20 +3830,11 @@ int finish_automount(struct vfsmount *m, const struct path *path) * that overmounts our mountpoint to be means "quitely drop what we've * got", not "try to mount it on top". */ - inode_lock(dentry->d_inode); - namespace_lock(); - if (unlikely(cant_mount(dentry))) { - err = -ENOENT; - goto discard_locked; - } - if (path_overmounted(path)) { - err = 0; - goto discard_locked; + err = lock_mount_exact(path, &mp); + if (unlikely(err)) { + mntput(m); + return err == -EBUSY ? 0 : err; } - err = get_mountpoint(dentry, &mp); - if (err) - goto discard_locked; - err = do_add_mount(mnt, mp.mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); unlock_mount(&mp); @@ -3831,9 +3842,6 @@ int finish_automount(struct vfsmount *m, const struct path *path) goto discard; return 0; -discard_locked: - namespace_unlock(); - inode_unlock(dentry->d_inode); discard: mntput(m); return err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 24/52] finish_automount(): take the lock_mount() analogue into a helper 2025-08-25 4:43 ` [PATCH 24/52] finish_automount(): take the lock_mount() analogue into a helper Al Viro @ 2025-08-25 13:08 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:08 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:27AM +0100, Al Viro wrote: > finish_automount() can't use lock_mount() - it treats finding something > already mounted as "quitely drop our mount and return 0", not as > "mount on top of whatever mounted there". It's been open-coded; > let's take it into a helper similar to lock_mount(). "something's > already mounted" => -EBUSY, finish_automount() needs to distinguish > it from the normal case and it can't happen in other failure cases. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> > fs/namespace.c | 42 +++++++++++++++++++++++++----------------- > 1 file changed, 25 insertions(+), 17 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index 892251663419..99757040a39a 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -3786,9 +3786,29 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, > return err; > } > > -int finish_automount(struct vfsmount *m, const struct path *path) > +static int lock_mount_exact(const struct path *path, > + struct pinned_mountpoint *mp) > { > struct dentry *dentry = path->dentry; > + int err; > + > + inode_lock(dentry->d_inode); > + namespace_lock(); > + if (unlikely(cant_mount(dentry))) > + err = -ENOENT; > + else if (path_overmounted(path)) > + err = -EBUSY; > + else > + err = get_mountpoint(dentry, mp); > + if (unlikely(err)) { > + namespace_unlock(); > + inode_unlock(dentry->d_inode); > + } > + return err; > +} > + > +int finish_automount(struct vfsmount *m, const struct path *path) > +{ > struct pinned_mountpoint mp = {}; > struct mount *mnt; > int err; > @@ -3810,20 +3830,11 @@ int finish_automount(struct vfsmount *m, const struct path *path) > * that overmounts our mountpoint to be means "quitely drop what we've > * got", not "try to mount it on top". > */ > - inode_lock(dentry->d_inode); > - namespace_lock(); > - if (unlikely(cant_mount(dentry))) { > - err = -ENOENT; > - goto discard_locked; > - } > - if (path_overmounted(path)) { > - err = 0; > - goto discard_locked; > + err = lock_mount_exact(path, &mp); > + if (unlikely(err)) { > + mntput(m); > + return err == -EBUSY ? 0 : err; > } > - err = get_mountpoint(dentry, &mp); > - if (err) > - goto discard_locked; > - > err = do_add_mount(mnt, mp.mp, path, > path->mnt->mnt_flags | MNT_SHRINKABLE); > unlock_mount(&mp); > @@ -3831,9 +3842,6 @@ int finish_automount(struct vfsmount *m, const struct path *path) > goto discard; > return 0; > > -discard_locked: > - namespace_unlock(); > - inode_unlock(dentry->d_inode); > discard: > mntput(m); Can use direct returns if you do: struct mount *mnt __free(mntput) = NULL; and then in the success condition: retain_and_null_ptr(mnt); ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (22 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 24/52] finish_automount(): take the lock_mount() analogue into a helper Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:29 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 26/52] finish_automount(): use __free() to deal with dropping mnt on failure Al Viro ` (27 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds do_add_mount() consumes vfsmount on success; just follow it with conditional retain_and_null_ptr() on success and we can switch to __free() for mnt and be done with that - unlock_mount() is in the very end. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 99757040a39a..79c87937a7dd 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3694,7 +3694,6 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { - struct vfsmount *mnt; struct pinned_mountpoint mp = {}; struct super_block *sb = fc->root->d_sb; int error; @@ -3710,7 +3709,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, up_write(&sb->s_umount); - mnt = vfs_create_mount(fc); + struct vfsmount *mnt __free(mntput) = vfs_create_mount(fc); if (IS_ERR(mnt)) return PTR_ERR(mnt); @@ -3720,10 +3719,10 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, if (!error) { error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); + if (!error) + retain_and_null_ptr(mnt); // consumed on success unlock_mount(&mp); } - if (error < 0) - mntput(mnt); return error; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure 2025-08-25 4:43 ` [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure Al Viro @ 2025-08-25 13:29 ` Christian Brauner 2025-08-25 16:09 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:29 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:28AM +0100, Al Viro wrote: > do_add_mount() consumes vfsmount on success; just follow it with > conditional retain_and_null_ptr() on success and we can switch > to __free() for mnt and be done with that - unlock_mount() is > in the very end. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/namespace.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index 99757040a39a..79c87937a7dd 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -3694,7 +3694,6 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags > static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, > unsigned int mnt_flags) > { > - struct vfsmount *mnt; > struct pinned_mountpoint mp = {}; > struct super_block *sb = fc->root->d_sb; > int error; > @@ -3710,7 +3709,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, > > up_write(&sb->s_umount); > > - mnt = vfs_create_mount(fc); > + struct vfsmount *mnt __free(mntput) = vfs_create_mount(fc); Ugh, can we please not start declaring variables in the middle of a scope. > if (IS_ERR(mnt)) > return PTR_ERR(mnt); > > @@ -3720,10 +3719,10 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, > if (!error) { > error = do_add_mount(real_mount(mnt), mp.mp, > mountpoint, mnt_flags); > + if (!error) > + retain_and_null_ptr(mnt); // consumed on success > unlock_mount(&mp); > } > - if (error < 0) > - mntput(mnt); > return error; > } > > -- > 2.47.2 > ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure 2025-08-25 13:29 ` Christian Brauner @ 2025-08-25 16:09 ` Al Viro 2025-08-26 8:27 ` Christian Brauner 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 16:09 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 03:29:33PM +0200, Christian Brauner wrote: > > - mnt = vfs_create_mount(fc); > > + struct vfsmount *mnt __free(mntput) = vfs_create_mount(fc); > > Ugh, can we please not start declaring variables in the middle of a > scope. Seeing that it *is* the beginning of its scope, what do you suggest? Declaring it above, initializing with NULL and reassigning here? That's actually just as wrong, if not more so - any assignment added to it at earlier point and you've got a silent leak, so verifying correctness would be harder that way. ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure 2025-08-25 16:09 ` Al Viro @ 2025-08-26 8:27 ` Christian Brauner 2025-08-26 17:00 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Christian Brauner @ 2025-08-26 8:27 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:09:39PM +0100, Al Viro wrote: > On Mon, Aug 25, 2025 at 03:29:33PM +0200, Christian Brauner wrote: > > > - mnt = vfs_create_mount(fc); > > > + struct vfsmount *mnt __free(mntput) = vfs_create_mount(fc); > > > > Ugh, can we please not start declaring variables in the middle of a > > scope. > > Seeing that it *is* the beginning of its scope, what do you suggest? What? Did I miss earlier or later changes because: static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { struct vfsmount *mnt; struct pinned_mountpoint mp = {}; struct super_block *sb = fc->root->d_sb; int error; error = security_sb_kern_mount(sb); if (!error && mount_too_revealing(sb, &mnt_flags)) error = -EPERM; if (unlikely(error)) { fc_drop_locked(fc); return error; } up_write(&sb->s_umount); mnt = vfs_create_mount(fc); if (IS_ERR(mnt)) return PTR_ERR(mnt); How does up_write() create a new scope? mnt_warn_timestamp_expiry(mountpoint, mnt); error = lock_mount(mountpoint, &mp); if (!error) { error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); unlock_mount(&mp); } if (error < 0) mntput(mnt); return error; } > Declaring it above, initializing with NULL and reassigning here? > That's actually just as wrong, if not more so - any assignment added I disagree. I do very much prefer having cleanups at the top of the function or e.g.,: if (foo) { struct vfsmount *mnt __free(mntput) = vfs_create_mount(fc); } Because it is really easy to figure out visually. But just doing it somewhere in the middle is just confusing. static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { struct pinned_mountpoint mp = {}; struct super_block *sb = fc->root->d_sb; int error; error = security_sb_kern_mount(sb); if (!error && mount_too_revealing(sb, &mnt_flags)) error = -eperm; if (unlikely(error)) { fc_drop_locked(fc); return error; } up_write(&sb->s_umount); struct vfsmount *mnt __free(mntput) = vfs_create_mount(fc); if (is_err(mnt)) return ptr_err(mnt); ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure 2025-08-26 8:27 ` Christian Brauner @ 2025-08-26 17:00 ` Al Viro 2025-08-26 17:55 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-26 17:00 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, jack, torvalds On Tue, Aug 26, 2025 at 10:27:56AM +0200, Christian Brauner wrote: > > Declaring it above, initializing with NULL and reassigning here? > > That's actually just as wrong, if not more so - any assignment added > > I disagree. I do very much prefer having cleanups at the top of the > function or e.g.,: > > if (foo) { > struct vfsmount *mnt __free(mntput) = vfs_create_mount(fc); > } > > Because it is really easy to figure out visually. But just doing it > somewhere in the middle is just confusing. So basically you treat __free() simply as a syntax sugar for "call this on exits from this block", rather than an approximation for "here's an auto object we've created, this should be called to destroy it at the end of its scope/lifetime"? IMO it's a bad practice - it makes life much harder when you are tracing callchains, etc. FWIW, I wonder if the things would be cleaner if we did security_sb_kern_mount() and mount_too_revealing() *after* unlocking the superblock and getting a vfsmount. The latter definitely doesn't give a damn about superblock being locked and AFAICS neither does the only in-tree instance of ->sb_kern_mount(). That way we have the real initialization reasonably close to __free() and control flow is easier to follow... Folks, how about something like the delta below (on top of the posted queue)? diff --git a/fs/namespace.c b/fs/namespace.c index 63b74d7384fd..191e7f776de5 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3689,24 +3689,22 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags static int do_new_mount_fc(struct fs_context *fc, const struct path *mountpoint, unsigned int mnt_flags) { + struct vfsmount *mnt __free(mntput) = NULL; struct super_block *sb = fc->root->d_sb; int error; - error = security_sb_kern_mount(sb); - if (!error && mount_too_revealing(sb, &mnt_flags)) - error = -EPERM; - - if (unlikely(error)) { - fc_drop_locked(fc); - return error; - } - up_write(&sb->s_umount); - - struct vfsmount *mnt __free(mntput) = vfs_create_mount(fc); + mnt = vfs_create_mount(fc); if (IS_ERR(mnt)) return PTR_ERR(mnt); + error = security_sb_kern_mount(sb); + if (unlikely(error)) + return error; + + if (mount_too_revealing(sb, &mnt_flags)) + return -EPERM; + mnt_warn_timestamp_expiry(mountpoint, mnt); LOCK_MOUNT(mp, mountpoint); ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure 2025-08-26 17:00 ` Al Viro @ 2025-08-26 17:55 ` Al Viro 2025-08-26 18:21 ` [RFC][PATCH] switch do_new_mount_fc() to using fc_mount() Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-26 17:55 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, jack, torvalds On Tue, Aug 26, 2025 at 06:00:44PM +0100, Al Viro wrote: > FWIW, I wonder if the things would be cleaner if we did security_sb_kern_mount() > and mount_too_revealing() *after* unlocking the superblock and getting a vfsmount. > The latter definitely doesn't give a damn about superblock being locked and > AFAICS neither does the only in-tree instance of ->sb_kern_mount(). > That way we have the real initialization reasonably close to __free() and > control flow is easier to follow... Or, better yet, take vfs_get_tree() from do_new_mount() to do_new_mount_fc() and collapse it with "unlock ->s_umount and call vfs_create_mount()" into a call of fc_mount(), like the delta below (on top of posted queue, would get reordered ealier in it and pick the bits of #25 along the way). Does anyone have objections here? The only real change is that security_sb_kern_mount() gets called outside of ->s_umount exclusive scope; no in-tree instances care, but I'd Cc that to LSM list... diff --git a/fs/namespace.c b/fs/namespace.c index 63b74d7384fd..6f062dc7f9bf 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3690,22 +3690,18 @@ static int do_new_mount_fc(struct fs_context *fc, const struct path *mountpoint, unsigned int mnt_flags) { struct super_block *sb = fc->root->d_sb; + struct vfsmount *mnt __free(mntput) = fc_mount(fc); int error; - error = security_sb_kern_mount(sb); - if (!error && mount_too_revealing(sb, &mnt_flags)) - error = -EPERM; + if (IS_ERR(mnt)) + return PTR_ERR(mnt); - if (unlikely(error)) { - fc_drop_locked(fc); + error = security_sb_kern_mount(sb); + if (unlikely(error)) return error; - } - up_write(&sb->s_umount); - - struct vfsmount *mnt __free(mntput) = vfs_create_mount(fc); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); + if (mount_too_revealing(sb, &mnt_flags)) + return -EPERM; mnt_warn_timestamp_expiry(mountpoint, mnt); @@ -3767,8 +3763,6 @@ static int do_new_mount(const struct path *path, const char *fstype, err = parse_monolithic_mount_data(fc, data); if (!err && !mount_capable(fc)) err = -EPERM; - if (!err) - err = vfs_get_tree(fc); if (!err) err = do_new_mount_fc(fc, path, mnt_flags); ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [RFC][PATCH] switch do_new_mount_fc() to using fc_mount() 2025-08-26 17:55 ` Al Viro @ 2025-08-26 18:21 ` Al Viro 2025-08-27 15:38 ` Paul Moore 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-26 18:21 UTC (permalink / raw) To: Linus Torvalds Cc: linux-fsdevel, jack, Christian Brauner, linux-security-module, Paul Moore [ This is on top of -rc3; if nobody objects, I'll insert that early in series in viro/vfs.git#work.mount. It has an impact for LSM folks - ->sb_kern_mount() would be called without ->s_umount; nothing in-tree cares, but if you have objections, yell now. ] Prior to the call of do_new_mount_fc() the caller has just done successful vfs_get_tree(). Then do_new_mount_fc() does several checks on resulting superblock, and either does fc_drop_locked() and returns an error or proceeds to unlock the superblock and call vfs_create_mount(). The thing is, there's no reason to delay that unlock + vfs_create_mount() - the tests do not rely upon the state of ->s_umount and fc_drop_locked() put_fs_context() is equivalent to unlock ->s_umount put_fs_context() Doing vfs_create_mount() before the checks allows us to move vfs_get_tree() from caller to do_new_mount_fc() and collapse it with vfs_create_mount() into an fc_mount() call. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/fs/namespace.c b/fs/namespace.c index ae6d1312b184..9e1b7319532c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3721,25 +3721,19 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { - struct vfsmount *mnt; struct pinned_mountpoint mp = {}; struct super_block *sb = fc->root->d_sb; + struct vfsmount *mnt = fc_mount(fc); int error; + if (IS_ERR(mnt)) + return PTR_ERR(mnt); + error = security_sb_kern_mount(sb); if (!error && mount_too_revealing(sb, &mnt_flags)) error = -EPERM; - - if (unlikely(error)) { - fc_drop_locked(fc); - return error; - } - - up_write(&sb->s_umount); - - mnt = vfs_create_mount(fc); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); + if (unlikely(error)) + goto out; mnt_warn_timestamp_expiry(mountpoint, mnt); @@ -3747,10 +3741,12 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, if (!error) { error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); + if (!error) + mnt = NULL; // consumed on success unlock_mount(&mp); } - if (error < 0) - mntput(mnt); +out: + mntput(mnt); return error; } @@ -3804,8 +3800,6 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, err = parse_monolithic_mount_data(fc, data); if (!err && !mount_capable(fc)) err = -EPERM; - if (!err) - err = vfs_get_tree(fc); if (!err) err = do_new_mount_fc(fc, path, mnt_flags); ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [RFC][PATCH] switch do_new_mount_fc() to using fc_mount() 2025-08-26 18:21 ` [RFC][PATCH] switch do_new_mount_fc() to using fc_mount() Al Viro @ 2025-08-27 15:38 ` Paul Moore 0 siblings, 0 replies; 321+ messages in thread From: Paul Moore @ 2025-08-27 15:38 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, linux-fsdevel, jack, Christian Brauner, linux-security-module On Tue, Aug 26, 2025 at 2:21 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > [ > This is on top of -rc3; if nobody objects, I'll insert that early in series > in viro/vfs.git#work.mount. It has an impact for LSM folks - ->sb_kern_mount() > would be called without ->s_umount; nothing in-tree cares, but if you have > objections, yell now. > ] Thanks for the heads-up, I'm not aware of anyone currently posting/working-on patches that would be dependent on this. > Prior to the call of do_new_mount_fc() the caller has just done successful > vfs_get_tree(). Then do_new_mount_fc() does several checks on resulting > superblock, and either does fc_drop_locked() and returns an error or > proceeds to unlock the superblock and call vfs_create_mount(). > > The thing is, there's no reason to delay that unlock + vfs_create_mount() - > the tests do not rely upon the state of ->s_umount and > fc_drop_locked() > put_fs_context() > is equivalent to > unlock ->s_umount > put_fs_context() > > Doing vfs_create_mount() before the checks allows us to move vfs_get_tree() > from caller to do_new_mount_fc() and collapse it with vfs_create_mount() > into an fc_mount() call. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -- paul-moore.com ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 26/52] finish_automount(): use __free() to deal with dropping mnt on failure 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (23 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:09 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 27/52] change calling conventions for lock_mount() et.al Al Viro ` (26 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds same story as with do_new_mount_fc(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 79c87937a7dd..5819a50d7d67 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3806,8 +3806,9 @@ static int lock_mount_exact(const struct path *path, return err; } -int finish_automount(struct vfsmount *m, const struct path *path) +int finish_automount(struct vfsmount *__m, const struct path *path) { + struct vfsmount *m __free(mntput) = __m; struct pinned_mountpoint mp = {}; struct mount *mnt; int err; @@ -3819,10 +3820,8 @@ int finish_automount(struct vfsmount *m, const struct path *path) mnt = real_mount(m); - if (m->mnt_root == path->dentry) { - err = -ELOOP; - goto discard; - } + if (m->mnt_root == path->dentry) + return -ELOOP; /* * we don't want to use lock_mount() - in this case finding something @@ -3830,19 +3829,14 @@ int finish_automount(struct vfsmount *m, const struct path *path) * got", not "try to mount it on top". */ err = lock_mount_exact(path, &mp); - if (unlikely(err)) { - mntput(m); + if (unlikely(err)) return err == -EBUSY ? 0 : err; - } + err = do_add_mount(mnt, mp.mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); + if (likely(!err)) + retain_and_null_ptr(m); unlock_mount(&mp); - if (unlikely(err)) - goto discard; - return 0; - -discard: - mntput(m); return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 26/52] finish_automount(): use __free() to deal with dropping mnt on failure 2025-08-25 4:43 ` [PATCH 26/52] finish_automount(): use __free() to deal with dropping mnt on failure Al Viro @ 2025-08-25 13:09 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:09 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:29AM +0100, Al Viro wrote: > same story as with do_new_mount_fc(). > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Ah right, here it is what I suggested earlier, Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 27/52] change calling conventions for lock_mount() et.al. 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (24 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 26/52] finish_automount(): use __free() to deal with dropping mnt on failure Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 4:43 ` [PATCH 28/52] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro ` (25 subsequent siblings) 51 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds 1) pinned_mountpoint gets a new member - struct mount *parent. Set only if we locked the sucker; ERR_PTR() - on failed attempt. 2) do_lock_mount() et.al. return void and set ->parent to * on success with !beneath - mount corresponding to path->mnt * on success with beneath - the parent of mount corresponding to path->mnt * in case of error - ERR_PTR(-E...). IOW, we get the mount we will be actually mounting upon or ERR_PTR(). 3) we can't use CLASS, since the pinned_mountpoint is placed on hlist during initialization, so we define local macros: LOCK_MOUNT(mp, path) LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) LOCK_MOUNT_EXACT(mp, path) All of them declare and initialize struct pinned_mountpoint mp, with unlock_mount done via __cleanup(). Users converted. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 219 ++++++++++++++++++++++++------------------------- 1 file changed, 108 insertions(+), 111 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 5819a50d7d67..8d6e26e2c97a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -919,6 +919,7 @@ bool __is_local_mountpoint(const struct dentry *dentry) struct pinned_mountpoint { struct hlist_node node; struct mountpoint *mp; + struct mount *parent; }; static bool lookup_mountpoint(struct dentry *dentry, struct pinned_mountpoint *m) @@ -2728,48 +2729,47 @@ static int attach_recursive_mnt(struct mount *source_mnt, } /** - * do_lock_mount - lock mount and mountpoint - * @path: target path - * @beneath: whether the intention is to mount beneath @path + * do_lock_mount - acquire environment for mounting + * @path: target path + * @res: context to set up + * @beneath: whether the intention is to mount beneath @path * - * Follow the mount stack on @path until the top mount @mnt is found. If - * the initial @path->{mnt,dentry} is a mountpoint lookup the first - * mount stacked on top of it. Then simply follow @{mnt,mnt->mnt_root} - * until nothing is stacked on top of it anymore. + * To mount something at given location, we need + * namespace_sem locked exclusive + * inode of dentry we are mounting on locked exclusive + * struct mountpoint for that dentry + * struct mount we are mounting on * - * Acquire the inode_lock() on the top mount's ->mnt_root to protect - * against concurrent removal of the new mountpoint from another mount - * namespace. + * Results are stored in caller-supplied context (pinned_mountpoint); + * on success we have res->parent and res->mp pointing to parent and + * mountpoint respectively and res->node inserted into the ->m_list + * of the mountpoint, making sure the mountpoint won't disappear. + * On failure we have res->parent set to ERR_PTR(-E...), res->mp + * left NULL, res->node - empty. + * In case of success do_lock_mount returns with locks acquired (in + * proper order - inode lock nests outside of namespace_sem). * - * If @beneath is requested, acquire inode_lock() on @mnt's mountpoint - * @mp on @mnt->mnt_parent must be acquired. This protects against a - * concurrent unlink of @mp->mnt_dentry from another mount namespace - * where @mnt doesn't have a child mount mounted @mp. A concurrent - * removal of @mnt->mnt_root doesn't matter as nothing will be mounted - * on top of it for @beneath. + * Request to mount on overmounted location is treated as "mount on + * top of whatever's overmounting it"; request to mount beneath + * a location - "mount immediately beneath the topmost mount at that + * place". * - * In addition, @beneath needs to make sure that @mnt hasn't been - * unmounted or moved from its current mountpoint in between dropping - * @mount_lock and acquiring @namespace_sem. For the !@beneath case @mnt - * being unmounted would be detected later by e.g., calling - * check_mnt(mnt) in the function it's called from. For the @beneath - * case however, it's useful to detect it directly in do_lock_mount(). - * If @mnt hasn't been unmounted then @mnt->mnt_mountpoint still points - * to @mnt->mnt_mp->m_dentry. But if @mnt has been unmounted it will - * point to @mnt->mnt_root and @mnt->mnt_mp will be NULL. - * - * Return: Either the target mountpoint on the top mount or the top - * mount's mountpoint. + * In all cases the location must not have been unmounted and the + * chosen mountpoint must be allowed to be mounted on. For "beneath" + * case we also require the location to be at the root of a mount + * that has a parent (i.e. is not a root of some namespace). */ -static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bool beneath) +static void do_lock_mount(struct path *path, struct pinned_mountpoint *res, bool beneath) { struct vfsmount *mnt = path->mnt; struct dentry *dentry; struct path under = {}; int err = -ENOENT; - if (unlikely(beneath) && !path_mounted(path)) - return -EINVAL; + if (unlikely(beneath) && !path_mounted(path)) { + res->parent = ERR_PTR(-EINVAL); + return; + } for (;;) { struct mount *m = real_mount(mnt); @@ -2779,7 +2779,8 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo read_seqlock_excl(&mount_lock); if (unlikely(!mnt_has_parent(m))) { read_sequnlock_excl(&mount_lock); - return -EINVAL; + res->parent = ERR_PTR(-EINVAL); + return; } under.mnt = mntget(&m->mnt_parent->mnt); under.dentry = dget(m->mnt_mountpoint); @@ -2811,7 +2812,7 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo path->dentry = dget(mnt->mnt_root); continue; // got overmounted } - err = get_mountpoint(dentry, pinned); + err = get_mountpoint(dentry, res); if (err) break; if (beneath) { @@ -2822,22 +2823,25 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo * we are not dropping the final references here). */ path_put(&under); + res->parent = real_mount(path->mnt)->mnt_parent; + return; } - return 0; + res->parent = real_mount(path->mnt); + return; } namespace_unlock(); inode_unlock(dentry->d_inode); if (beneath) path_put(&under); - return err; + res->parent = ERR_PTR(err); } -static inline int lock_mount(struct path *path, struct pinned_mountpoint *m) +static inline void lock_mount(struct path *path, struct pinned_mountpoint *m) { - return do_lock_mount(path, m, false); + do_lock_mount(path, m, false); } -static void unlock_mount(struct pinned_mountpoint *m) +static void __unlock_mount(struct pinned_mountpoint *m) { inode_unlock(m->mp->m_dentry->d_inode); read_seqlock_excl(&mount_lock); @@ -2846,6 +2850,20 @@ static void unlock_mount(struct pinned_mountpoint *m) namespace_unlock(); } +static inline void unlock_mount(struct pinned_mountpoint *m) +{ + if (!IS_ERR(m->parent)) + __unlock_mount(m); +} + +#define LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) \ + struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \ + do_lock_mount((path), &mp, (beneath)) +#define LOCK_MOUNT(mp, path) LOCK_MOUNT_MAYBE_BENEATH(mp, (path), false) +#define LOCK_MOUNT_EXACT(mp, path) \ + struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \ + lock_mount_exact((path), &mp) + static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) { if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER) @@ -3015,8 +3033,7 @@ static int do_loopback(struct path *path, const char *old_name, int recurse) { struct path old_path __free(path_put) = {}; - struct mount *mnt = NULL, *parent; - struct pinned_mountpoint mp = {}; + struct mount *mnt = NULL; int err; if (!old_name || !*old_name) return -EINVAL; @@ -3027,28 +3044,23 @@ static int do_loopback(struct path *path, const char *old_name, if (mnt_ns_loop(old_path.dentry)) return -EINVAL; - err = lock_mount(path, &mp); - if (err) - return err; + LOCK_MOUNT(mp, path); + if (IS_ERR(mp.parent)) + return PTR_ERR(mp.parent); - parent = real_mount(path->mnt); - if (!check_mnt(parent)) - goto out2; + if (!check_mnt(mp.parent)) + return -EINVAL; mnt = __do_loopback(&old_path, recurse); - if (IS_ERR(mnt)) { - err = PTR_ERR(mnt); - goto out2; - } + if (IS_ERR(mnt)) + return PTR_ERR(mnt); - err = graft_tree(mnt, parent, mp.mp); + err = graft_tree(mnt, mp.parent, mp.mp); if (err) { lock_mount_hash(); umount_tree(mnt, UMOUNT_SYNC); unlock_mount_hash(); } -out2: - unlock_mount(&mp); return err; } @@ -3561,7 +3573,6 @@ static int do_move_mount(struct path *old_path, { struct mount *p; struct mount *old = real_mount(old_path->mnt); - struct pinned_mountpoint mp; int err; bool beneath = flags & MNT_TREE_BENEATH; @@ -3571,52 +3582,49 @@ static int do_move_mount(struct path *old_path, if (d_is_dir(new_path->dentry) != d_is_dir(old_path->dentry)) return -EINVAL; - err = do_lock_mount(new_path, &mp, beneath); - if (err) - return err; + LOCK_MOUNT_MAYBE_BENEATH(mp, new_path, beneath); + if (IS_ERR(mp.parent)) + return PTR_ERR(mp.parent); p = real_mount(new_path->mnt); - err = -EINVAL; - if (check_mnt(old)) { /* if the source is in our namespace... */ /* ... it should be detachable from parent */ if (!mnt_has_parent(old) || IS_MNT_LOCKED(old)) - goto out; + return -EINVAL; /* ... which should not be shared */ if (IS_MNT_SHARED(old->mnt_parent)) - goto out; + return -EINVAL; /* ... and the target should be in our namespace */ if (!check_mnt(p)) - goto out; + return -EINVAL; } else { /* * otherwise the source must be the root of some anon namespace. */ if (!anon_ns_root(old)) - goto out; + return -EINVAL; /* * Bail out early if the target is within the same namespace - * subsequent checks would've rejected that, but they lose * some corner cases if we check it early. */ if (old->mnt_ns == p->mnt_ns) - goto out; + return -EINVAL; /* * Target should be either in our namespace or in an acceptable * anon namespace, sensu check_anonymous_mnt(). */ if (!may_use_mount(p)) - goto out; + return -EINVAL; } if (beneath) { err = can_move_mount_beneath(old, new_path, mp.mp); if (err) - goto out; + return err; - err = -EINVAL; p = p->mnt_parent; } @@ -3625,17 +3633,13 @@ static int do_move_mount(struct path *old_path, * mount which is shared. */ if (IS_MNT_SHARED(p) && tree_contains_unbindable(old)) - goto out; - err = -ELOOP; + return -EINVAL; if (!check_for_nsfs_mounts(old)) - goto out; + return -ELOOP; if (mount_is_ancestor(old, p)) - goto out; + return -ELOOP; - err = attach_recursive_mnt(old, p, mp.mp); -out: - unlock_mount(&mp); - return err; + return attach_recursive_mnt(old, p, mp.mp); } static int do_move_mount_old(struct path *path, const char *old_name) @@ -3694,7 +3698,6 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { - struct pinned_mountpoint mp = {}; struct super_block *sb = fc->root->d_sb; int error; @@ -3715,13 +3718,14 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, mnt_warn_timestamp_expiry(mountpoint, mnt); - error = lock_mount(mountpoint, &mp); - if (!error) { + LOCK_MOUNT(mp, mountpoint); + if (IS_ERR(mp.parent)) { + return PTR_ERR(mp.parent); + } else { error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); if (!error) retain_and_null_ptr(mnt); // consumed on success - unlock_mount(&mp); } return error; } @@ -3785,8 +3789,8 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, return err; } -static int lock_mount_exact(const struct path *path, - struct pinned_mountpoint *mp) +static void lock_mount_exact(const struct path *path, + struct pinned_mountpoint *mp) { struct dentry *dentry = path->dentry; int err; @@ -3802,14 +3806,15 @@ static int lock_mount_exact(const struct path *path, if (unlikely(err)) { namespace_unlock(); inode_unlock(dentry->d_inode); + mp->parent = ERR_PTR(err); + } else { + mp->parent = real_mount(path->mnt); } - return err; } int finish_automount(struct vfsmount *__m, const struct path *path) { struct vfsmount *m __free(mntput) = __m; - struct pinned_mountpoint mp = {}; struct mount *mnt; int err; @@ -3828,15 +3833,14 @@ int finish_automount(struct vfsmount *__m, const struct path *path) * that overmounts our mountpoint to be means "quitely drop what we've * got", not "try to mount it on top". */ - err = lock_mount_exact(path, &mp); - if (unlikely(err)) - return err == -EBUSY ? 0 : err; + LOCK_MOUNT_EXACT(mp, path); + if (IS_ERR(mp.parent)) + return mp.parent == ERR_PTR(-EBUSY) ? 0 : PTR_ERR(mp.parent); err = do_add_mount(mnt, mp.mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); if (likely(!err)) retain_and_null_ptr(m); - unlock_mount(&mp); return err; } @@ -4633,7 +4637,6 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, struct path old __free(path_put) = {}; struct path root __free(path_put) = {}; struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent; - struct pinned_mountpoint old_mp = {}; int error; if (!may_mount()) @@ -4654,45 +4657,42 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, return error; get_fs_root(current->fs, &root); - error = lock_mount(&old, &old_mp); - if (error) - return error; - error = -EINVAL; + LOCK_MOUNT(old_mp, &old); + old_mnt = old_mp.parent; + if (IS_ERR(old_mnt)) + return PTR_ERR(old_mnt); + new_mnt = real_mount(new.mnt); root_mnt = real_mount(root.mnt); - old_mnt = real_mount(old.mnt); ex_parent = new_mnt->mnt_parent; root_parent = root_mnt->mnt_parent; if (IS_MNT_SHARED(old_mnt) || IS_MNT_SHARED(ex_parent) || IS_MNT_SHARED(root_parent)) - goto out4; + return -EINVAL; if (!check_mnt(root_mnt) || !check_mnt(new_mnt)) - goto out4; + return -EINVAL; if (new_mnt->mnt.mnt_flags & MNT_LOCKED) - goto out4; - error = -ENOENT; + return -EINVAL; if (d_unlinked(new.dentry)) - goto out4; - error = -EBUSY; + return -ENOENT; if (new_mnt == root_mnt || old_mnt == root_mnt) - goto out4; /* loop, on the same file system */ - error = -EINVAL; + return -EBUSY; /* loop, on the same file system */ if (!path_mounted(&root)) - goto out4; /* not a mountpoint */ + return -EINVAL; /* not a mountpoint */ if (!mnt_has_parent(root_mnt)) - goto out4; /* absolute root */ + return -EINVAL; /* absolute root */ if (!path_mounted(&new)) - goto out4; /* not a mountpoint */ + return -EINVAL; /* not a mountpoint */ if (!mnt_has_parent(new_mnt)) - goto out4; /* absolute root */ + return -EINVAL; /* absolute root */ /* make sure we can reach put_old from new_root */ if (!is_path_reachable(old_mnt, old.dentry, &new)) - goto out4; + return -EINVAL; /* make certain new is below the root */ if (!is_path_reachable(new_mnt, new.dentry, &root)) - goto out4; + return -EINVAL; lock_mount_hash(); umount_mnt(new_mnt); if (root_mnt->mnt.mnt_flags & MNT_LOCKED) { @@ -4711,10 +4711,7 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, mnt_notify_add(root_mnt); mnt_notify_add(new_mnt); chroot_fs_refs(&root, &new); - error = 0; -out4: - unlock_mount(&old_mp); - return error; + return 0; } static unsigned int recalc_flags(struct mount_kattr *kattr, struct mount *mnt) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH 28/52] do_move_mount(): use the parent mount returned by do_lock_mount() 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (25 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 27/52] change calling conventions for lock_mount() et.al Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 4:43 ` [PATCH 29/52] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro ` (24 subsequent siblings) 51 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds After successful do_lock_mount() call, mp.parent is set to either real_mount(path->mnt) (for !beneath case) or to ->mnt_parent of that (for beneath). p is set to real_mount(path->mnt) and after several uses it's made equal to mp.parent. All uses prior to that care only about p->mnt_ns and since p->mnt_ns == parent->mnt_ns, we might as well use mp.parent all along. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 8d6e26e2c97a..05019dde25a0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3571,7 +3571,6 @@ static inline bool may_use_mount(struct mount *mnt) static int do_move_mount(struct path *old_path, struct path *new_path, enum mnt_tree_flags_t flags) { - struct mount *p; struct mount *old = real_mount(old_path->mnt); int err; bool beneath = flags & MNT_TREE_BENEATH; @@ -3586,8 +3585,6 @@ static int do_move_mount(struct path *old_path, if (IS_ERR(mp.parent)) return PTR_ERR(mp.parent); - p = real_mount(new_path->mnt); - if (check_mnt(old)) { /* if the source is in our namespace... */ /* ... it should be detachable from parent */ @@ -3597,7 +3594,7 @@ static int do_move_mount(struct path *old_path, if (IS_MNT_SHARED(old->mnt_parent)) return -EINVAL; /* ... and the target should be in our namespace */ - if (!check_mnt(p)) + if (!check_mnt(mp.parent)) return -EINVAL; } else { /* @@ -3610,13 +3607,13 @@ static int do_move_mount(struct path *old_path, * subsequent checks would've rejected that, but they lose * some corner cases if we check it early. */ - if (old->mnt_ns == p->mnt_ns) + if (old->mnt_ns == mp.parent->mnt_ns) return -EINVAL; /* * Target should be either in our namespace or in an acceptable * anon namespace, sensu check_anonymous_mnt(). */ - if (!may_use_mount(p)) + if (!may_use_mount(mp.parent)) return -EINVAL; } @@ -3624,22 +3621,20 @@ static int do_move_mount(struct path *old_path, err = can_move_mount_beneath(old, new_path, mp.mp); if (err) return err; - - p = p->mnt_parent; } /* * Don't move a mount tree containing unbindable mounts to a destination * mount which is shared. */ - if (IS_MNT_SHARED(p) && tree_contains_unbindable(old)) + if (IS_MNT_SHARED(mp.parent) && tree_contains_unbindable(old)) return -EINVAL; if (!check_for_nsfs_mounts(old)) return -ELOOP; - if (mount_is_ancestor(old, p)) + if (mount_is_ancestor(old, mp.parent)) return -ELOOP; - return attach_recursive_mnt(old, p, mp.mp); + return attach_recursive_mnt(old, mp.parent, mp.mp); } static int do_move_mount_old(struct path *path, const char *old_name) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH 29/52] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (26 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 28/52] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 4:43 ` [PATCH 30/52] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro ` (23 subsequent siblings) 51 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Both callers pass it a mountpoint reference picked from pinned_mountpoint and path it corresponds to. First of all, path->dentry is equal to mp.mp->m_dentry. Furthermore, path->mnt is &mp.parent->mnt, making struct path contents redundant. Pass it the address of that pinned_mountpoint instead; what's more, if we teach it to treat ERR_PTR(error) in ->parent as "bail out with that error" we can simplify the callers even more - do_add_mount() will do the right thing even when called after lock_mount() failure. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 32 +++++++++++++++----------------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 05019dde25a0..06c672127aee 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3657,10 +3657,13 @@ static int do_move_mount_old(struct path *path, const char *old_name) /* * add a mount into a namespace's mount tree */ -static int do_add_mount(struct mount *newmnt, struct mountpoint *mp, - const struct path *path, int mnt_flags) +static int do_add_mount(struct mount *newmnt, const struct pinned_mountpoint *mp, + int mnt_flags) { - struct mount *parent = real_mount(path->mnt); + struct mount *parent = mp->parent; + + if (IS_ERR(parent)) + return PTR_ERR(parent); mnt_flags &= ~MNT_INTERNAL_FLAGS; @@ -3674,14 +3677,15 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp, } /* Refuse the same filesystem on the same mount point */ - if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path_mounted(path)) + if (parent->mnt.mnt_sb == newmnt->mnt.mnt_sb && + parent->mnt.mnt_root == mp->mp->m_dentry) return -EBUSY; if (d_is_symlink(newmnt->mnt.mnt_root)) return -EINVAL; newmnt->mnt.mnt_flags = mnt_flags; - return graft_tree(newmnt, parent, mp); + return graft_tree(newmnt, parent, mp->mp); } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags); @@ -3714,14 +3718,9 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, mnt_warn_timestamp_expiry(mountpoint, mnt); LOCK_MOUNT(mp, mountpoint); - if (IS_ERR(mp.parent)) { - return PTR_ERR(mp.parent); - } else { - error = do_add_mount(real_mount(mnt), mp.mp, - mountpoint, mnt_flags); - if (!error) - retain_and_null_ptr(mnt); // consumed on success - } + error = do_add_mount(real_mount(mnt), &mp, mnt_flags); + if (!error) + retain_and_null_ptr(mnt); // consumed on success return error; } @@ -3829,11 +3828,10 @@ int finish_automount(struct vfsmount *__m, const struct path *path) * got", not "try to mount it on top". */ LOCK_MOUNT_EXACT(mp, path); - if (IS_ERR(mp.parent)) - return mp.parent == ERR_PTR(-EBUSY) ? 0 : PTR_ERR(mp.parent); + if (mp.parent == ERR_PTR(-EBUSY)) + return 0; - err = do_add_mount(mnt, mp.mp, path, - path->mnt->mnt_flags | MNT_SHRINKABLE); + err = do_add_mount(mnt, &mp, path->mnt->mnt_flags | MNT_SHRINKABLE); if (likely(!err)) retain_and_null_ptr(m); return err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH 30/52] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (27 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 29/52] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 4:43 ` [PATCH 31/52] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro ` (22 subsequent siblings) 51 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds parent and mountpoint always come from the same struct pinned_mountpoint now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 06c672127aee..9ffdbb093f57 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2613,10 +2613,11 @@ enum mnt_tree_flags_t { * Otherwise a negative error code is returned. */ static int attach_recursive_mnt(struct mount *source_mnt, - struct mount *dest_mnt, - struct mountpoint *dest_mp) + const struct pinned_mountpoint *dest) { struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns; + struct mount *dest_mnt = dest->parent; + struct mountpoint *dest_mp = dest->mp; HLIST_HEAD(tree_list); struct mnt_namespace *ns = dest_mnt->mnt_ns; struct pinned_mountpoint root = {}; @@ -2864,16 +2865,16 @@ static inline void unlock_mount(struct pinned_mountpoint *m) struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \ lock_mount_exact((path), &mp) -static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) +static int graft_tree(struct mount *mnt, const struct pinned_mountpoint *mp) { if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER) return -EINVAL; - if (d_is_dir(mp->m_dentry) != + if (d_is_dir(mp->mp->m_dentry) != d_is_dir(mnt->mnt.mnt_root)) return -ENOTDIR; - return attach_recursive_mnt(mnt, p, mp); + return attach_recursive_mnt(mnt, mp); } static int may_change_propagation(const struct mount *m) @@ -3055,7 +3056,7 @@ static int do_loopback(struct path *path, const char *old_name, if (IS_ERR(mnt)) return PTR_ERR(mnt); - err = graft_tree(mnt, mp.parent, mp.mp); + err = graft_tree(mnt, &mp); if (err) { lock_mount_hash(); umount_tree(mnt, UMOUNT_SYNC); @@ -3634,7 +3635,7 @@ static int do_move_mount(struct path *old_path, if (mount_is_ancestor(old, mp.parent)) return -ELOOP; - return attach_recursive_mnt(old, mp.parent, mp.mp); + return attach_recursive_mnt(old, &mp); } static int do_move_mount_old(struct path *path, const char *old_name) @@ -3685,7 +3686,7 @@ static int do_add_mount(struct mount *newmnt, const struct pinned_mountpoint *mp return -EINVAL; newmnt->mnt.mnt_flags = mnt_flags; - return graft_tree(newmnt, parent, mp->mp); + return graft_tree(newmnt, mp); } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH 31/52] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (28 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 30/52] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:43 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 32/52] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro ` (21 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds That kills the last place where callers of lock_mount(path, &mp) used path->dentry. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9ffdbb093f57..494433d2e04b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4682,7 +4682,7 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, if (!mnt_has_parent(new_mnt)) return -EINVAL; /* absolute root */ /* make sure we can reach put_old from new_root */ - if (!is_path_reachable(old_mnt, old.dentry, &new)) + if (!is_path_reachable(old_mnt, old_mp.mp->m_dentry, &new)) return -EINVAL; /* make certain new is below the root */ if (!is_path_reachable(new_mnt, new.dentry, &root)) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 31/52] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry 2025-08-25 4:43 ` [PATCH 31/52] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro @ 2025-08-25 13:43 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:43 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:34AM +0100, Al Viro wrote: > That kills the last place where callers of lock_mount(path, &mp) > used path->dentry. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 32/52] don't bother passing new_path->dentry to can_move_mount_beneath() 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (29 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 31/52] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 4:43 ` [PATCH 33/52] new helper: topmost_overmount() Al Viro ` (20 subsequent siblings) 51 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 494433d2e04b..7d51763fc76c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3451,8 +3451,8 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) /** * can_move_mount_beneath - check that we can mount beneath the top mount * @mnt_from: mount we are trying to move - * @to: mount under which to mount - * @mp: mountpoint of @to + * @mnt_to: mount under which to mount + * @mp: mountpoint of @mnt_to * * - Make sure that nothing can be mounted beneath the caller's current * root or the rootfs of the namespace. @@ -3468,11 +3468,10 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * Return: On success 0, and on error a negative error code is returned. */ static int can_move_mount_beneath(struct mount *mnt_from, - const struct path *to, + struct mount *mnt_to, const struct mountpoint *mp) { - struct mount *mnt_to = real_mount(to->mnt), - *parent_mnt_to = mnt_to->mnt_parent; + struct mount *parent_mnt_to = mnt_to->mnt_parent; if (IS_MNT_LOCKED(mnt_to)) return -EINVAL; @@ -3619,7 +3618,7 @@ static int do_move_mount(struct path *old_path, } if (beneath) { - err = can_move_mount_beneath(old, new_path, mp.mp); + err = can_move_mount_beneath(old, real_mount(new_path->mnt), mp.mp); if (err) return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH 33/52] new helper: topmost_overmount() 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (30 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 32/52] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:43 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 34/52] do_lock_mount(): don't modify path Al Viro ` (19 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Returns the final (topmost) mount in the chain of overmounts starting at given mount. Same locking rules as for any mount tree traversal - either the spinlock side of mount_lock, or rcu + sample the seqcount side of mount_lock before the call and recheck afterwards. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 7 +++++++ fs/namespace.c | 9 +++------ 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index ed8c83ba836a..04d0eadc4c10 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -235,4 +235,11 @@ static inline void mnt_notify_add(struct mount *m) } #endif +static inline struct mount *topmost_overmount(struct mount *m) +{ + while (m->overmount) + m = m->overmount; + return m; +} + struct mnt_namespace *mnt_ns_from_dentry(struct dentry *dentry); diff --git a/fs/namespace.c b/fs/namespace.c index 7d51763fc76c..93eba16e42b6 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2697,10 +2697,9 @@ static int attach_recursive_mnt(struct mount *source_mnt, child->mnt_mountpoint); commit_tree(child); if (q) { + struct mount *r = topmost_overmount(child); struct mountpoint *mp = root.mp; - struct mount *r = child; - while (unlikely(r->overmount)) - r = r->overmount; + if (unlikely(shorter) && child != source_mnt) mp = shorter; mnt_change_mountpoint(r, mp, q); @@ -6178,9 +6177,7 @@ bool current_chrooted(void) guard(mount_locked_reader)(); - root = current->nsproxy->mnt_ns->root; - while (unlikely(root->overmount)) - root = root->overmount; + root = topmost_overmount(current->nsproxy->mnt_ns->root); return fs_root.mnt != &root->mnt || !path_mounted(&fs_root); } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 33/52] new helper: topmost_overmount() 2025-08-25 4:43 ` [PATCH 33/52] new helper: topmost_overmount() Al Viro @ 2025-08-25 13:43 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:43 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:36AM +0100, Al Viro wrote: > Returns the final (topmost) mount in the chain of overmounts > starting at given mount. Same locking rules as for any mount > tree traversal - either the spinlock side of mount_lock, or > rcu + sample the seqcount side of mount_lock before the call > and recheck afterwards. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 34/52] do_lock_mount(): don't modify path. 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (31 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 33/52] new helper: topmost_overmount() Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-26 14:14 ` Askar Safin 2025-08-25 4:43 ` [PATCH 35/52] constify check_mnt() Al Viro ` (18 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Currently do_lock_mount() has the target path switched to whatever might be overmounting it. We _do_ want to have the parent mount/mountpoint chosen on top of the overmounting pile; however, the way it's done has unpleasant races - if umount propagation removes the overmount while we'd been trying to set the environment up, we might end up failing if our target path strays into that overmount just before the overmount gets kicked out. Users of do_lock_mount() do not need the target path changed - they have all information in res->{parent,mp}; only one place (in do_move_mount()) currently uses the resulting path->mnt, and that value is trivial to reconstruct by the original value of path->mnt + chosen parent mount. Let's keep the target path unchanged; it avoids a bunch of subtle races and it's not hard to do: do as mount_locked_reader find the prospective parent mount/mountpoint dentry grab references if it's not the original target lock the prospective mountpoint dentry take namespace_sem exclusive if prospective parent/mountpoint would be different now err = -EAGAIN else if location has been unmounted err = -ENOENT else if mountpoint dentry is not allowed to be mounted on err = -ENOENT else if beneath and the top of the pile was the absolute root err = -EINVAL else try to get struct mountpoint (by dentry), set err to 0 on success and -ENO{MEM,ENT} on failure if err != 0 res->parent = ERR_PTR(err) drop locks else res->parent = prospective parent drop temporary references while err == -EAGAIN A somewhat subtle part is that dropping temporary references is allowed. Neither mounts nor dentries should be evicted by a thread that holds namespace_sem. On success we are dropping those references under namespace_sem, so we need to be sure that these are not the last references remaining. However, on success we'd already verified (under namespace_sem) that original target is still mounted and that mount and dentry we are about to drop are still reachable from it via the mount tree. That guarantees that we are not about to drop the last remaining references. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 126 ++++++++++++++++++++++++++----------------------- 1 file changed, 68 insertions(+), 58 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 93eba16e42b6..f95e12ab6c9a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2728,6 +2728,27 @@ static int attach_recursive_mnt(struct mount *source_mnt, return err; } +static inline struct mount *where_to_mount(const struct path *path, + struct dentry **dentry, + bool beneath) +{ + struct mount *m; + + if (unlikely(beneath)) { + m = topmost_overmount(real_mount(path->mnt)); + *dentry = m->mnt_mountpoint; + return m->mnt_parent; + } else { + m = __lookup_mnt(path->mnt, *dentry = path->dentry); + if (unlikely(m)) { + m = topmost_overmount(m); + *dentry = m->mnt.mnt_root; + return m; + } + return real_mount(path->mnt); + } +} + /** * do_lock_mount - acquire environment for mounting * @path: target path @@ -2759,84 +2780,69 @@ static int attach_recursive_mnt(struct mount *source_mnt, * case we also require the location to be at the root of a mount * that has a parent (i.e. is not a root of some namespace). */ -static void do_lock_mount(struct path *path, struct pinned_mountpoint *res, bool beneath) +static void do_lock_mount(const struct path *path, + struct pinned_mountpoint *res, + bool beneath) { - struct vfsmount *mnt = path->mnt; - struct dentry *dentry; - struct path under = {}; - int err = -ENOENT; + int err; if (unlikely(beneath) && !path_mounted(path)) { res->parent = ERR_PTR(-EINVAL); return; } - for (;;) { - struct mount *m = real_mount(mnt); - - if (beneath) { - path_put(&under); - read_seqlock_excl(&mount_lock); - if (unlikely(!mnt_has_parent(m))) { - read_sequnlock_excl(&mount_lock); - res->parent = ERR_PTR(-EINVAL); - return; + do { + struct dentry *dentry, *d; + struct mount *m, *n; + + scoped_guard(mount_locked_reader) { + m = where_to_mount(path, &dentry, beneath); + if (&m->mnt != path->mnt) { + mntget(&m->mnt); + dget(dentry); } - under.mnt = mntget(&m->mnt_parent->mnt); - under.dentry = dget(m->mnt_mountpoint); - read_sequnlock_excl(&mount_lock); - dentry = under.dentry; - } else { - dentry = path->dentry; } inode_lock(dentry->d_inode); namespace_lock(); - if (unlikely(cant_mount(dentry) || !is_mounted(mnt))) - break; // not to be mounted on + // check if the chain of mounts (if any) has changed. + scoped_guard(mount_locked_reader) + n = where_to_mount(path, &d, beneath); - if (beneath && unlikely(m->mnt_mountpoint != dentry || - &m->mnt_parent->mnt != under.mnt)) { - namespace_unlock(); - inode_unlock(dentry->d_inode); - continue; // got moved - } + if (unlikely(n != m || dentry != d)) + err = -EAGAIN; // something moved, retry + else if (unlikely(cant_mount(dentry) || !is_mounted(path->mnt))) + err = -ENOENT; // not to be mounted on + else if (beneath && &m->mnt == path->mnt && !m->overmount) + err = -EINVAL; + else + err = get_mountpoint(dentry, res); - mnt = lookup_mnt(path); - if (unlikely(mnt)) { + if (unlikely(err)) { + res->parent = ERR_PTR(err); namespace_unlock(); inode_unlock(dentry->d_inode); - path_put(path); - path->mnt = mnt; - path->dentry = dget(mnt->mnt_root); - continue; // got overmounted + } else { + res->parent = m; } - err = get_mountpoint(dentry, res); - if (err) - break; - if (beneath) { - /* - * @under duplicates the references that will stay - * at least until namespace_unlock(), so the path_put() - * below is safe (and OK to do under namespace_lock - - * we are not dropping the final references here). - */ - path_put(&under); - res->parent = real_mount(path->mnt)->mnt_parent; - return; + /* + * Drop the temporary references. This is subtle - on success + * we are doing that under namespace_sem, which would normally + * be forbidden. However, in that case we are guaranteed that + * refcounts won't reach zero, since we know that path->mnt + * is mounted and thus all mounts reachable from it are pinned + * and stable, along with their mountpoints and roots. + */ + if (&m->mnt != path->mnt) { + dput(dentry); + mntput(&m->mnt); } - res->parent = real_mount(path->mnt); - return; - } - namespace_unlock(); - inode_unlock(dentry->d_inode); - if (beneath) - path_put(&under); - res->parent = ERR_PTR(err); + } while (err == -EAGAIN); } -static inline void lock_mount(struct path *path, struct pinned_mountpoint *m) +static inline void lock_mount(const struct path *path, + struct pinned_mountpoint *m) { do_lock_mount(path, m, false); } @@ -3617,7 +3623,11 @@ static int do_move_mount(struct path *old_path, } if (beneath) { - err = can_move_mount_beneath(old, real_mount(new_path->mnt), mp.mp); + struct mount *over = real_mount(new_path->mnt); + + if (mp.parent != over->mnt_parent) + over = mp.parent->overmount; + err = can_move_mount_beneath(old, over, mp.mp); if (err) return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 34/52] do_lock_mount(): don't modify path. 2025-08-25 4:43 ` [PATCH 34/52] do_lock_mount(): don't modify path Al Viro @ 2025-08-26 14:14 ` Askar Safin 0 siblings, 0 replies; 321+ messages in thread From: Askar Safin @ 2025-08-26 14:14 UTC (permalink / raw) To: viro; +Cc: brauner, jack, linux-fsdevel, torvalds > + m = __lookup_mnt(path->mnt, *dentry = path->dentry); I don't like this. Someone may think you meant "*dentry == path->dentry" here. Please, write this: *dentry = path->dentry; m = __lookup_mnt(path->mnt, *dentry); -- Askar Safin ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 35/52] constify check_mnt() 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (32 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 34/52] do_lock_mount(): don't modify path Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:43 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 36/52] do_mount_setattr(): constify path argument Al Viro ` (17 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index f95e12ab6c9a..458bef569816 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1010,7 +1010,7 @@ static void unpin_mountpoint(struct pinned_mountpoint *m) } } -static inline int check_mnt(struct mount *mnt) +static inline int check_mnt(const struct mount *mnt) { return mnt->mnt_ns == current->nsproxy->mnt_ns; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 35/52] constify check_mnt() 2025-08-25 4:43 ` [PATCH 35/52] constify check_mnt() Al Viro @ 2025-08-25 13:43 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:43 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:38AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 36/52] do_mount_setattr(): constify path argument 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (33 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 35/52] constify check_mnt() Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:30 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 37/52] do_set_group(): constify path arguments Al Viro ` (16 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 458bef569816..2db9b006e37e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4872,7 +4872,7 @@ static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt) touch_mnt_namespace(mnt->mnt_ns); } -static int do_mount_setattr(struct path *path, struct mount_kattr *kattr) +static int do_mount_setattr(const struct path *path, struct mount_kattr *kattr) { struct mount *mnt = real_mount(path->mnt); int err = 0; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 36/52] do_mount_setattr(): constify path argument 2025-08-25 4:43 ` [PATCH 36/52] do_mount_setattr(): constify path argument Al Viro @ 2025-08-25 13:30 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:30 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:39AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 37/52] do_set_group(): constify path arguments 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (34 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 36/52] do_mount_setattr(): constify path argument Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:29 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 38/52] drop_collected_paths(): constify arguments Al Viro ` (15 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 2db9b006e37e..d61601fc97ca 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3360,7 +3360,7 @@ static inline int tree_contains_unbindable(struct mount *mnt) return 0; } -static int do_set_group(struct path *from_path, struct path *to_path) +static int do_set_group(const struct path *from_path, const struct path *to_path) { struct mount *from = real_mount(from_path->mnt); struct mount *to = real_mount(to_path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 37/52] do_set_group(): constify path arguments 2025-08-25 4:43 ` [PATCH 37/52] do_set_group(): constify path arguments Al Viro @ 2025-08-25 13:29 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:29 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:40AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 38/52] drop_collected_paths(): constify arguments 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (35 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 37/52] do_set_group(): constify path arguments Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:31 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 39/52] collect_paths(): constify the return value Al Viro ` (14 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and use that to constify the pointers in callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- include/linux/mount.h | 2 +- kernel/audit_tree.c | 12 ++++++------ 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index d61601fc97ca..d29d7c948ec1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2334,9 +2334,9 @@ struct path *collect_paths(const struct path *path, return res; } -void drop_collected_paths(struct path *paths, struct path *prealloc) +void drop_collected_paths(const struct path *paths, struct path *prealloc) { - for (struct path *p = paths; p->mnt; p++) + for (const struct path *p = paths; p->mnt; p++) path_put(p); if (paths != prealloc) kfree(paths); diff --git a/include/linux/mount.h b/include/linux/mount.h index 5f9c053b0897..c09032463b36 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -105,7 +105,7 @@ extern int may_umount(struct vfsmount *); int do_mount(const char *, const char __user *, const char *, unsigned long, void *); extern struct path *collect_paths(const struct path *, struct path *, unsigned); -extern void drop_collected_paths(struct path *, struct path *); +extern void drop_collected_paths(const struct path *, struct path *); extern void kern_unmount_array(struct vfsmount *mnt[], unsigned int num); extern int cifs_root_data(char **dev, char **opts); diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index b0eae2a3c895..32007edf0e55 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -678,7 +678,7 @@ void audit_trim_trees(void) struct audit_tree *tree; struct path path; struct audit_node *node; - struct path *paths; + const struct path *paths; struct path array[16]; int err; @@ -701,7 +701,7 @@ void audit_trim_trees(void) struct audit_chunk *chunk = find_chunk(node); /* this could be NULL if the watch is dying else where... */ node->index |= 1U<<31; - for (struct path *p = paths; p->dentry; p++) { + for (const struct path *p = paths; p->dentry; p++) { struct inode *inode = p->dentry->d_inode; if (inode_to_key(inode) == chunk->key) { node->index &= ~(1U<<31); @@ -740,9 +740,9 @@ void audit_put_tree(struct audit_tree *tree) put_tree(tree); } -static int tag_mounts(struct path *paths, struct audit_tree *tree) +static int tag_mounts(const struct path *paths, struct audit_tree *tree) { - for (struct path *p = paths; p->dentry; p++) { + for (const struct path *p = paths; p->dentry; p++) { int err = tag_chunk(p->dentry->d_inode, tree); if (err) return err; @@ -805,7 +805,7 @@ int audit_add_tree_rule(struct audit_krule *rule) struct audit_tree *seed = rule->tree, *tree; struct path path; struct path array[16]; - struct path *paths; + const struct path *paths; int err; rule->tree = NULL; @@ -877,7 +877,7 @@ int audit_tag_tree(char *old, char *new) int failed = 0; struct path path1, path2; struct path array[16]; - struct path *paths; + const struct path *paths; int err; err = kern_path(new, 0, &path2); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 38/52] drop_collected_paths(): constify arguments 2025-08-25 4:43 ` [PATCH 38/52] drop_collected_paths(): constify arguments Al Viro @ 2025-08-25 13:31 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:31 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:41AM +0100, Al Viro wrote: > ... and use that to constify the pointers in callers > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 39/52] collect_paths(): constify the return value 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (36 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 38/52] drop_collected_paths(): constify arguments Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:30 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 40/52] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro ` (13 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds callers have no business modifying the paths they get Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- include/linux/mount.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index d29d7c948ec1..cc4e18040506 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2300,7 +2300,7 @@ static inline bool extend_array(struct path **res, struct path **to_free, return p; } -struct path *collect_paths(const struct path *path, +const struct path *collect_paths(const struct path *path, struct path *prealloc, unsigned count) { struct mount *root = real_mount(path->mnt); @@ -2334,7 +2334,7 @@ struct path *collect_paths(const struct path *path, return res; } -void drop_collected_paths(const struct path *paths, struct path *prealloc) +void drop_collected_paths(const struct path *paths, const struct path *prealloc) { for (const struct path *p = paths; p->mnt; p++) path_put(p); diff --git a/include/linux/mount.h b/include/linux/mount.h index c09032463b36..18e4b97f8a98 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -104,8 +104,8 @@ extern int may_umount_tree(struct vfsmount *); extern int may_umount(struct vfsmount *); int do_mount(const char *, const char __user *, const char *, unsigned long, void *); -extern struct path *collect_paths(const struct path *, struct path *, unsigned); -extern void drop_collected_paths(const struct path *, struct path *); +extern const struct path *collect_paths(const struct path *, struct path *, unsigned); +extern void drop_collected_paths(const struct path *, const struct path *); extern void kern_unmount_array(struct vfsmount *mnt[], unsigned int num); extern int cifs_root_data(char **dev, char **opts); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 39/52] collect_paths(): constify the return value 2025-08-25 4:43 ` [PATCH 39/52] collect_paths(): constify the return value Al Viro @ 2025-08-25 13:30 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:30 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:42AM +0100, Al Viro wrote: > callers have no business modifying the paths they get > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 40/52] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (37 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 39/52] collect_paths(): constify the return value Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:30 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 41/52] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro ` (12 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index cc4e18040506..4704630847af 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3573,8 +3573,9 @@ static inline bool may_use_mount(struct mount *mnt) return check_anonymous_mnt(mnt); } -static int do_move_mount(struct path *old_path, - struct path *new_path, enum mnt_tree_flags_t flags) +static int do_move_mount(const struct path *old_path, + const struct path *new_path, + enum mnt_tree_flags_t flags) { struct mount *old = real_mount(old_path->mnt); int err; @@ -3646,7 +3647,7 @@ static int do_move_mount(struct path *old_path, return attach_recursive_mnt(old, &mp); } -static int do_move_mount_old(struct path *path, const char *old_name) +static int do_move_mount_old(const struct path *path, const char *old_name) { struct path old_path; int err; @@ -4481,7 +4482,8 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, return ret; } -static inline int vfs_move_mount(struct path *from_path, struct path *to_path, +static inline int vfs_move_mount(const struct path *from_path, + const struct path *to_path, enum mnt_tree_flags_t mflags) { int ret; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 40/52] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) 2025-08-25 4:43 ` [PATCH 40/52] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro @ 2025-08-25 13:30 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:30 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:43AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 41/52] mnt_warn_timestamp_expiry(): constify struct path argument 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (38 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 40/52] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:32 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 42/52] do_new_mount{,_fc}(): " Al Viro ` (11 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 4704630847af..70636922310c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3231,7 +3231,8 @@ static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags) touch_mnt_namespace(mnt->mnt_ns); } -static void mnt_warn_timestamp_expiry(struct path *mountpoint, struct vfsmount *mnt) +static void mnt_warn_timestamp_expiry(const struct path *mountpoint, + struct vfsmount *mnt) { struct super_block *sb = mnt->mnt_sb; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 41/52] mnt_warn_timestamp_expiry(): constify struct path argument 2025-08-25 4:43 ` [PATCH 41/52] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro @ 2025-08-25 13:32 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:32 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:44AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 42/52] do_new_mount{,_fc}(): constify struct path argument 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (39 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 41/52] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:30 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 43/52] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro ` (10 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 70636922310c..bf1a6efd335e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3705,7 +3705,7 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags * Create a new mount using a superblock configuration and request it * be added to the namespace tree. */ -static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, +static int do_new_mount_fc(struct fs_context *fc, const struct path *mountpoint, unsigned int mnt_flags) { struct super_block *sb = fc->root->d_sb; @@ -3739,8 +3739,9 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, * create a new mount for userspace and request it to be added into the * namespace's tree */ -static int do_new_mount(struct path *path, const char *fstype, int sb_flags, - int mnt_flags, const char *name, void *data) +static int do_new_mount(const struct path *path, const char *fstype, + int sb_flags, int mnt_flags, + const char *name, void *data) { struct file_system_type *type; struct fs_context *fc; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 42/52] do_new_mount{,_fc}(): constify struct path argument 2025-08-25 4:43 ` [PATCH 42/52] do_new_mount{,_fc}(): " Al Viro @ 2025-08-25 13:30 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:30 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:45AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 43/52] do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (40 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 42/52] do_new_mount{,_fc}(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:31 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 44/52] path_mount(): " Al Viro ` (9 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index bf1a6efd335e..68c12866205c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2915,7 +2915,7 @@ static int flags_to_propagation_type(int ms_flags) /* * recursively change the type of the mountpoint. */ -static int do_change_type(struct path *path, int ms_flags) +static int do_change_type(const struct path *path, int ms_flags) { struct mount *m; struct mount *mnt = real_mount(path->mnt); @@ -3035,8 +3035,8 @@ static struct mount *__do_loopback(struct path *old_path, int recurse) /* * do loopback mount. */ -static int do_loopback(struct path *path, const char *old_name, - int recurse) +static int do_loopback(const struct path *path, const char *old_name, + int recurse) { struct path old_path __free(path_put) = {}; struct mount *mnt = NULL; @@ -3266,7 +3266,7 @@ static void mnt_warn_timestamp_expiry(const struct path *mountpoint, * superblock it refers to. This is triggered by specifying MS_REMOUNT|MS_BIND * to mount(2). */ -static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags) +static int do_reconfigure_mnt(const struct path *path, unsigned int mnt_flags) { struct super_block *sb = path->mnt->mnt_sb; struct mount *mnt = real_mount(path->mnt); @@ -3303,7 +3303,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags) * If you've mounted a non-root directory somewhere and want to do remount * on it - tough luck. */ -static int do_remount(struct path *path, int ms_flags, int sb_flags, +static int do_remount(const struct path *path, int ms_flags, int sb_flags, int mnt_flags, void *data) { int err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 43/52] do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument 2025-08-25 4:43 ` [PATCH 43/52] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro @ 2025-08-25 13:31 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:31 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:46AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 44/52] path_mount(): constify struct path argument 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (41 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 43/52] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:32 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 45/52] may_copy_tree(), __do_loopback(): " Al Viro ` (8 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds now it finally can be done. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/internal.h | 2 +- fs/namespace.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 38e8aab27bbd..fe88563b4822 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -84,7 +84,7 @@ void mnt_put_write_access_file(struct file *file); extern void dissolve_on_fput(struct vfsmount *); extern bool may_mount(void); -int path_mount(const char *dev_name, struct path *path, +int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page); int path_umount(struct path *path, int flags); diff --git a/fs/namespace.c b/fs/namespace.c index 68c12866205c..94eec417cc61 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4024,7 +4024,7 @@ static char *copy_mount_string(const void __user *data) * Therefore, if this magic number is present, it carries no information * and must be discarded. */ -int path_mount(const char *dev_name, struct path *path, +int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page) { unsigned int mnt_flags = 0, sb_flags; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 44/52] path_mount(): constify struct path argument 2025-08-25 4:43 ` [PATCH 44/52] path_mount(): " Al Viro @ 2025-08-25 13:32 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:32 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:47AM +0100, Al Viro wrote: > now it finally can be done. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 45/52] may_copy_tree(), __do_loopback(): constify struct path argument 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (42 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 44/52] path_mount(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:40 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 46/52] path_umount(): " Al Viro ` (7 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 94eec417cc61..a94aa249cedb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2991,7 +2991,7 @@ static int do_change_type(const struct path *path, int ms_flags) * * Returns true if the mount tree can be copied, false otherwise. */ -static inline bool may_copy_tree(struct path *path) +static inline bool may_copy_tree(const struct path *path) { struct mount *mnt = real_mount(path->mnt); const struct dentry_operations *d_op; @@ -3013,7 +3013,7 @@ static inline bool may_copy_tree(struct path *path) } -static struct mount *__do_loopback(struct path *old_path, int recurse) +static struct mount *__do_loopback(const struct path *old_path, int recurse) { struct mount *old = real_mount(old_path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 45/52] may_copy_tree(), __do_loopback(): constify struct path argument 2025-08-25 4:43 ` [PATCH 45/52] may_copy_tree(), __do_loopback(): " Al Viro @ 2025-08-25 13:40 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:40 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:48AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 46/52] path_umount(): constify struct path argument 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (43 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 45/52] may_copy_tree(), __do_loopback(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:40 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 47/52] constify can_move_mount_beneath() arguments Al Viro ` (6 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/internal.h | 2 +- fs/namespace.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index fe88563b4822..549e6bd453b0 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -86,7 +86,7 @@ extern bool may_mount(void); int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page); -int path_umount(struct path *path, int flags); +int path_umount(const struct path *path, int flags); int show_path(struct seq_file *m, struct dentry *root); diff --git a/fs/namespace.c b/fs/namespace.c index a94aa249cedb..76f0dde2ff62 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2084,7 +2084,7 @@ static int can_umount(const struct path *path, int flags) } // caller is responsible for flags being sane -int path_umount(struct path *path, int flags) +int path_umount(const struct path *path, int flags) { struct mount *mnt = real_mount(path->mnt); int ret; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 46/52] path_umount(): constify struct path argument 2025-08-25 4:43 ` [PATCH 46/52] path_umount(): " Al Viro @ 2025-08-25 13:40 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:40 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:49AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 47/52] constify can_move_mount_beneath() arguments 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (44 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 46/52] path_umount(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:39 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 48/52] do_move_mount_old(): use __free(path_put) Al Viro ` (5 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 76f0dde2ff62..c6fd5d4d7947 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3473,8 +3473,8 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * Context: This function expects namespace_lock() to be held. * Return: On success 0, and on error a negative error code is returned. */ -static int can_move_mount_beneath(struct mount *mnt_from, - struct mount *mnt_to, +static int can_move_mount_beneath(const struct mount *mnt_from, + const struct mount *mnt_to, const struct mountpoint *mp) { struct mount *parent_mnt_to = mnt_to->mnt_parent; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 47/52] constify can_move_mount_beneath() arguments 2025-08-25 4:43 ` [PATCH 47/52] constify can_move_mount_beneath() arguments Al Viro @ 2025-08-25 13:39 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:39 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:50AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 48/52] do_move_mount_old(): use __free(path_put) 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (45 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 47/52] constify can_move_mount_beneath() arguments Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:40 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 49/52] do_mount(): " Al Viro ` (4 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index c6fd5d4d7947..da30c7b757d3 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3650,7 +3650,7 @@ static int do_move_mount(const struct path *old_path, static int do_move_mount_old(const struct path *path, const char *old_name) { - struct path old_path; + struct path old_path __free(path_put) = {}; int err; if (!old_name || !*old_name) @@ -3660,9 +3660,7 @@ static int do_move_mount_old(const struct path *path, const char *old_name) if (err) return err; - err = do_move_mount(&old_path, path, 0); - path_put(&old_path); - return err; + return do_move_mount(&old_path, path, 0); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 48/52] do_move_mount_old(): use __free(path_put) 2025-08-25 4:43 ` [PATCH 48/52] do_move_mount_old(): use __free(path_put) Al Viro @ 2025-08-25 13:40 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:40 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:51AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 49/52] do_mount(): use __free(path_put) 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (46 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 48/52] do_move_mount_old(): use __free(path_put) Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:32 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 50/52] umount_tree(): take all victims out of propagation graph at once Al Viro ` (3 subsequent siblings) 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index da30c7b757d3..d8554742b1c0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4104,15 +4104,13 @@ int path_mount(const char *dev_name, const struct path *path, int do_mount(const char *dev_name, const char __user *dir_name, const char *type_page, unsigned long flags, void *data_page) { - struct path path; + struct path path __free(path_put) = {}; int ret; ret = user_path_at(AT_FDCWD, dir_name, LOOKUP_FOLLOW, &path); if (ret) return ret; - ret = path_mount(dev_name, &path, type_page, flags, data_page); - path_put(&path); - return ret; + return path_mount(dev_name, &path, type_page, flags, data_page); } static struct ucounts *inc_mnt_namespaces(struct user_namespace *ns) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 49/52] do_mount(): use __free(path_put) 2025-08-25 4:43 ` [PATCH 49/52] do_mount(): " Al Viro @ 2025-08-25 13:32 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:32 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:52AM +0100, Al Viro wrote: > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 50/52] umount_tree(): take all victims out of propagation graph at once 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (47 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 49/52] do_mount(): " Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 4:43 ` [PATCH 51/52] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro ` (2 subsequent siblings) 51 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds For each removed mount we need to calculate where the slaves will end up. To avoid duplicating that work, do it for all mounts to be removed at once, taking the mounts themselves out of propagation graph as we go, then do all transfers; the duplicate work on finding destinations is avoided since if we run into a mount that already had destination found, we don't need to trace the rest of the way. That's guaranteed O(removed mounts) for finding destinations and removing from propagation graph and O(surviving mounts that have master removed) for transfers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 ++- fs/pnode.c | 67 +++++++++++++++++++++++++++++++++++++++----------- fs/pnode.h | 1 + 3 files changed, 55 insertions(+), 16 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index d8554742b1c0..82cab5459ec7 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1846,6 +1846,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how) if (how & UMOUNT_PROPAGATE) propagate_umount(&tmp_list); + bulk_make_private(&tmp_list); + while (!list_empty(&tmp_list)) { struct mnt_namespace *ns; bool disconnect; @@ -1870,7 +1872,6 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how) umount_mnt(p); } } - change_mnt_propagation(p, MS_PRIVATE); if (disconnect) hlist_add_head(&p->mnt_umount, &unmounted); diff --git a/fs/pnode.c b/fs/pnode.c index edaf9d9d0eaf..5d91c3e58d2a 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -71,19 +71,6 @@ static inline bool will_be_unmounted(struct mount *m) return m->mnt.mnt_flags & MNT_UMOUNT; } -static struct mount *propagation_source(struct mount *mnt) -{ - do { - struct mount *m; - for (m = next_peer(mnt); m != mnt; m = next_peer(m)) { - if (!will_be_unmounted(m)) - return m; - } - mnt = mnt->mnt_master; - } while (mnt && will_be_unmounted(mnt)); - return mnt; -} - static void transfer_propagation(struct mount *mnt, struct mount *to) { struct hlist_node *p = NULL, *n; @@ -112,11 +99,10 @@ void change_mnt_propagation(struct mount *mnt, int type) return; } if (IS_MNT_SHARED(mnt)) { - if (type == MS_SLAVE || !hlist_empty(&mnt->mnt_slave_list)) - m = propagation_source(mnt); if (list_empty(&mnt->mnt_share)) { mnt_release_group_id(mnt); } else { + m = next_peer(mnt); list_del_init(&mnt->mnt_share); mnt->mnt_group_id = 0; } @@ -137,6 +123,57 @@ void change_mnt_propagation(struct mount *mnt, int type) } } +static struct mount *trace_transfers(struct mount *m) +{ + while (1) { + struct mount *next = next_peer(m); + + if (next != m) { + list_del_init(&m->mnt_share); + m->mnt_group_id = 0; + m->mnt_master = next; + } else { + if (IS_MNT_SHARED(m)) + mnt_release_group_id(m); + next = m->mnt_master; + } + hlist_del_init(&m->mnt_slave); + CLEAR_MNT_SHARED(m); + SET_MNT_MARK(m); + + if (!next || !will_be_unmounted(next)) + return next; + if (IS_MNT_MARKED(next)) + return next->mnt_master; + m = next; + } +} + +static void set_destinations(struct mount *m, struct mount *master) +{ + struct mount *next; + + while ((next = m->mnt_master) != master) { + m->mnt_master = master; + m = next; + } +} + +void bulk_make_private(struct list_head *set) +{ + struct mount *m; + + list_for_each_entry(m, set, mnt_list) + if (!IS_MNT_MARKED(m)) + set_destinations(m, trace_transfers(m)); + + list_for_each_entry(m, set, mnt_list) { + transfer_propagation(m, m->mnt_master); + m->mnt_master = NULL; + CLEAR_MNT_MARK(m); + } +} + static struct mount *__propagation_next(struct mount *m, struct mount *origin) { diff --git a/fs/pnode.h b/fs/pnode.h index 00ab153e3e9d..b029db225f33 100644 --- a/fs/pnode.h +++ b/fs/pnode.h @@ -42,6 +42,7 @@ static inline bool peers(const struct mount *m1, const struct mount *m2) } void change_mnt_propagation(struct mount *, int); +void bulk_make_private(struct list_head *); int propagate_mnt(struct mount *, struct mountpoint *, struct mount *, struct hlist_head *); void propagate_umount(struct list_head *); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH 51/52] ecryptfs: get rid of pointless mount references in ecryptfs dentries 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (48 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 50/52] umount_tree(): take all victims out of propagation graph at once Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:41 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 52/52] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro 2025-08-25 12:30 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Christian Brauner 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ->lower_path.mnt has the same value for all dentries on given ecryptfs instance and if somebody goes for mountpoint-crossing variant where that would not be true, we can deal with that when it happens (and _not_ with duplicating these reference into each dentry). As it is, we are better off just sticking a reference into ecryptfs-private part of superblock and keeping it pinned until ->kill_sb(). That way we can stick a reference to underlying dentry right into ->d_fsdata of ecryptfs one, getting rid of indirection through struct ecryptfs_dentry_info, along with the entire struct ecryptfs_dentry_info machinery. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/ecryptfs/dentry.c | 14 +------------- fs/ecryptfs/ecryptfs_kernel.h | 27 +++++++++++---------------- fs/ecryptfs/file.c | 15 +++++++-------- fs/ecryptfs/inode.c | 19 +++++-------------- fs/ecryptfs/main.c | 24 ++++++------------------ 5 files changed, 30 insertions(+), 69 deletions(-) diff --git a/fs/ecryptfs/dentry.c b/fs/ecryptfs/dentry.c index 1dfd5b81d831..6648a924e31a 100644 --- a/fs/ecryptfs/dentry.c +++ b/fs/ecryptfs/dentry.c @@ -59,14 +59,6 @@ static int ecryptfs_d_revalidate(struct inode *dir, const struct qstr *name, return rc; } -struct kmem_cache *ecryptfs_dentry_info_cache; - -static void ecryptfs_dentry_free_rcu(struct rcu_head *head) -{ - kmem_cache_free(ecryptfs_dentry_info_cache, - container_of(head, struct ecryptfs_dentry_info, rcu)); -} - /** * ecryptfs_d_release * @dentry: The ecryptfs dentry @@ -75,11 +67,7 @@ static void ecryptfs_dentry_free_rcu(struct rcu_head *head) */ static void ecryptfs_d_release(struct dentry *dentry) { - struct ecryptfs_dentry_info *p = dentry->d_fsdata; - if (p) { - path_put(&p->lower_path); - call_rcu(&p->rcu, ecryptfs_dentry_free_rcu); - } + dput(dentry->d_fsdata); } const struct dentry_operations ecryptfs_dops = { diff --git a/fs/ecryptfs/ecryptfs_kernel.h b/fs/ecryptfs/ecryptfs_kernel.h index 1f562e75d0e4..9e6ab0b41337 100644 --- a/fs/ecryptfs/ecryptfs_kernel.h +++ b/fs/ecryptfs/ecryptfs_kernel.h @@ -258,13 +258,6 @@ struct ecryptfs_inode_info { struct ecryptfs_crypt_stat crypt_stat; }; -/* dentry private data. Each dentry must keep track of a lower - * vfsmount too. */ -struct ecryptfs_dentry_info { - struct path lower_path; - struct rcu_head rcu; -}; - /** * ecryptfs_global_auth_tok - A key used to encrypt all new files under the mountpoint * @flags: Status flags @@ -348,6 +341,7 @@ struct ecryptfs_mount_crypt_stat { /* superblock private data. */ struct ecryptfs_sb_info { struct super_block *wsi_sb; + struct vfsmount *lower_mnt; struct ecryptfs_mount_crypt_stat mount_crypt_stat; }; @@ -494,22 +488,25 @@ ecryptfs_set_superblock_lower(struct super_block *sb, } static inline void -ecryptfs_set_dentry_private(struct dentry *dentry, - struct ecryptfs_dentry_info *dentry_info) +ecryptfs_set_dentry_lower(struct dentry *dentry, + struct dentry *lower_dentry) { - dentry->d_fsdata = dentry_info; + dentry->d_fsdata = lower_dentry; } static inline struct dentry * ecryptfs_dentry_to_lower(struct dentry *dentry) { - return ((struct ecryptfs_dentry_info *)dentry->d_fsdata)->lower_path.dentry; + return dentry->d_fsdata; } -static inline const struct path * -ecryptfs_dentry_to_lower_path(struct dentry *dentry) +static inline struct path +ecryptfs_lower_path(struct dentry *dentry) { - return &((struct ecryptfs_dentry_info *)dentry->d_fsdata)->lower_path; + return (struct path){ + .mnt = ecryptfs_superblock_to_private(dentry->d_sb)->lower_mnt, + .dentry = ecryptfs_dentry_to_lower(dentry) + }; } #define ecryptfs_printk(type, fmt, arg...) \ @@ -532,7 +529,6 @@ extern unsigned int ecryptfs_number_of_users; extern struct kmem_cache *ecryptfs_auth_tok_list_item_cache; extern struct kmem_cache *ecryptfs_file_info_cache; -extern struct kmem_cache *ecryptfs_dentry_info_cache; extern struct kmem_cache *ecryptfs_inode_info_cache; extern struct kmem_cache *ecryptfs_sb_info_cache; extern struct kmem_cache *ecryptfs_header_cache; @@ -557,7 +553,6 @@ int ecryptfs_encrypt_and_encode_filename( size_t *encoded_name_size, struct ecryptfs_mount_crypt_stat *mount_crypt_stat, const char *name, size_t name_size); -struct dentry *ecryptfs_lower_dentry(struct dentry *this_dentry); void ecryptfs_dump_hex(char *data, int bytes); int virt_to_scatterlist(const void *addr, int size, struct scatterlist *sg, int sg_size); diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c index 5f8f96da09fe..7929411837cf 100644 --- a/fs/ecryptfs/file.c +++ b/fs/ecryptfs/file.c @@ -33,13 +33,12 @@ static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb, struct iov_iter *to) { ssize_t rc; - const struct path *path; struct file *file = iocb->ki_filp; rc = generic_file_read_iter(iocb, to); if (rc >= 0) { - path = ecryptfs_dentry_to_lower_path(file->f_path.dentry); - touch_atime(path); + struct path path = ecryptfs_lower_path(file->f_path.dentry); + touch_atime(&path); } return rc; } @@ -59,12 +58,11 @@ static ssize_t ecryptfs_splice_read_update_atime(struct file *in, loff_t *ppos, size_t len, unsigned int flags) { ssize_t rc; - const struct path *path; rc = filemap_splice_read(in, ppos, pipe, len, flags); if (rc >= 0) { - path = ecryptfs_dentry_to_lower_path(in->f_path.dentry); - touch_atime(path); + struct path path = ecryptfs_lower_path(in->f_path.dentry); + touch_atime(&path); } return rc; } @@ -283,6 +281,7 @@ static int ecryptfs_dir_open(struct inode *inode, struct file *file) * ecryptfs_lookup() */ struct ecryptfs_file_info *file_info; struct file *lower_file; + struct path path; /* Released in ecryptfs_release or end of function if failure */ file_info = kmem_cache_zalloc(ecryptfs_file_info_cache, GFP_KERNEL); @@ -292,8 +291,8 @@ static int ecryptfs_dir_open(struct inode *inode, struct file *file) "Error attempting to allocate memory\n"); return -ENOMEM; } - lower_file = dentry_open(ecryptfs_dentry_to_lower_path(ecryptfs_dentry), - file->f_flags, current_cred()); + path = ecryptfs_lower_path(ecryptfs_dentry); + lower_file = dentry_open(&path, file->f_flags, current_cred()); if (IS_ERR(lower_file)) { printk(KERN_ERR "%s: Error attempting to initialize " "the lower file for the dentry with name " diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index 72fbe1316ab8..d2b262dc485d 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -327,24 +327,15 @@ static int ecryptfs_i_size_read(struct dentry *dentry, struct inode *inode) static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry, struct dentry *lower_dentry) { - const struct path *path = ecryptfs_dentry_to_lower_path(dentry->d_parent); + struct dentry *lower_parent = ecryptfs_dentry_to_lower(dentry->d_parent); struct inode *inode, *lower_inode; - struct ecryptfs_dentry_info *dentry_info; int rc = 0; - dentry_info = kmem_cache_alloc(ecryptfs_dentry_info_cache, GFP_KERNEL); - if (!dentry_info) { - dput(lower_dentry); - return ERR_PTR(-ENOMEM); - } - fsstack_copy_attr_atime(d_inode(dentry->d_parent), - d_inode(path->dentry)); + d_inode(lower_parent)); BUG_ON(!d_count(lower_dentry)); - ecryptfs_set_dentry_private(dentry, dentry_info); - dentry_info->lower_path.mnt = mntget(path->mnt); - dentry_info->lower_path.dentry = lower_dentry; + ecryptfs_set_dentry_lower(dentry, lower_dentry); /* * negative dentry can go positive under us here - its parent is not @@ -1022,10 +1013,10 @@ static int ecryptfs_getattr(struct mnt_idmap *idmap, { struct dentry *dentry = path->dentry; struct kstat lower_stat; + struct path lower_path = ecryptfs_lower_path(dentry); int rc; - rc = vfs_getattr_nosec(ecryptfs_dentry_to_lower_path(dentry), - &lower_stat, request_mask, flags); + rc = vfs_getattr_nosec(&lower_path, &lower_stat, request_mask, flags); if (!rc) { fsstack_copy_attr_all(d_inode(dentry), ecryptfs_inode_to_lower(d_inode(dentry))); diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c index eab1beb846d3..2afbcbbd9546 100644 --- a/fs/ecryptfs/main.c +++ b/fs/ecryptfs/main.c @@ -106,15 +106,14 @@ static int ecryptfs_init_lower_file(struct dentry *dentry, struct file **lower_file) { const struct cred *cred = current_cred(); - const struct path *path = ecryptfs_dentry_to_lower_path(dentry); + struct path path = ecryptfs_lower_path(dentry); int rc; - rc = ecryptfs_privileged_open(lower_file, path->dentry, path->mnt, - cred); + rc = ecryptfs_privileged_open(lower_file, path.dentry, path.mnt, cred); if (rc) { printk(KERN_ERR "Error opening lower file " "for lower_dentry [0x%p] and lower_mnt [0x%p]; " - "rc = [%d]\n", path->dentry, path->mnt, rc); + "rc = [%d]\n", path.dentry, path.mnt, rc); (*lower_file) = NULL; } return rc; @@ -437,7 +436,6 @@ static int ecryptfs_get_tree(struct fs_context *fc) struct ecryptfs_fs_context *ctx = fc->fs_private; struct ecryptfs_sb_info *sbi = fc->s_fs_info; struct ecryptfs_mount_crypt_stat *mount_crypt_stat; - struct ecryptfs_dentry_info *root_info; const char *err = "Getting sb failed"; struct inode *inode; struct path path; @@ -543,14 +541,8 @@ static int ecryptfs_get_tree(struct fs_context *fc) goto out_free; } - rc = -ENOMEM; - root_info = kmem_cache_zalloc(ecryptfs_dentry_info_cache, GFP_KERNEL); - if (!root_info) - goto out_free; - - /* ->kill_sb() will take care of root_info */ - ecryptfs_set_dentry_private(s->s_root, root_info); - root_info->lower_path = path; + ecryptfs_set_dentry_lower(s->s_root, path.dentry); + sbi->lower_mnt = path.mnt; s->s_flags |= SB_ACTIVE; fc->root = dget(s->s_root); @@ -580,6 +572,7 @@ static void ecryptfs_kill_block_super(struct super_block *sb) kill_anon_super(sb); if (!sb_info) return; + mntput(sb_info->lower_mnt); ecryptfs_destroy_mount_crypt_stat(&sb_info->mount_crypt_stat); kmem_cache_free(ecryptfs_sb_info_cache, sb_info); } @@ -667,11 +660,6 @@ static struct ecryptfs_cache_info { .name = "ecryptfs_file_cache", .size = sizeof(struct ecryptfs_file_info), }, - { - .cache = &ecryptfs_dentry_info_cache, - .name = "ecryptfs_dentry_info_cache", - .size = sizeof(struct ecryptfs_dentry_info), - }, { .cache = &ecryptfs_inode_info_cache, .name = "ecryptfs_inode_cache", -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 51/52] ecryptfs: get rid of pointless mount references in ecryptfs dentries 2025-08-25 4:43 ` [PATCH 51/52] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro @ 2025-08-25 13:41 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:41 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:54AM +0100, Al Viro wrote: > ->lower_path.mnt has the same value for all dentries on given ecryptfs > instance and if somebody goes for mountpoint-crossing variant where that > would not be true, we can deal with that when it happens (and _not_ > with duplicating these reference into each dentry). > > As it is, we are better off just sticking a reference into ecryptfs-private > part of superblock and keeping it pinned until ->kill_sb(). The overlayfs model. > > That way we can stick a reference to underlying dentry right into ->d_fsdata > of ecryptfs one, getting rid of indirection through struct ecryptfs_dentry_info, > along with the entire struct ecryptfs_dentry_info machinery. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH 52/52] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (49 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 51/52] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro @ 2025-08-25 4:43 ` Al Viro 2025-08-25 13:42 ` Christian Brauner 2025-08-25 12:30 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Christian Brauner 51 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 4:43 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Comments regarding "shadow mounts" were stale - no such thing anymore. Document the locking requirements for __lookup_mnt(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 41 ++++++++++++----------------------------- 1 file changed, 12 insertions(+), 29 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 82cab5459ec7..538313b3b7d9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -825,24 +825,16 @@ static bool legitimize_mnt(struct vfsmount *bastard, unsigned seq) } /** - * __lookup_mnt - find first child mount + * __lookup_mnt - mount hash lookup * @mnt: parent mount - * @dentry: mountpoint + * @dentry: dentry of mountpoint * - * If @mnt has a child mount @c mounted @dentry find and return it. + * If @mnt has a child mount @c mounted on @dentry find and return it. + * Caller must either hold the spinlock component of @mount_lock or + * hold rcu_read_lock(), sample the seqcount component before the call + * and recheck it afterwards. * - * Note that the child mount @c need not be unique. There are cases - * where shadow mounts are created. For example, during mount - * propagation when a source mount @mnt whose root got overmounted by a - * mount @o after path lookup but before @namespace_sem could be - * acquired gets copied and propagated. So @mnt gets copied including - * @o. When @mnt is propagated to a destination mount @d that already - * has another mount @n mounted at the same mountpoint then the source - * mount @mnt will be tucked beneath @n, i.e., @n will be mounted on - * @mnt and @mnt mounted on @d. Now both @n and @o are mounted at @mnt - * on @dentry. - * - * Return: The first child of @mnt mounted @dentry or NULL. + * Return: The child of @mnt mounted on @dentry or %NULL. */ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) { @@ -855,21 +847,12 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) return NULL; } -/* - * lookup_mnt - Return the first child mount mounted at path - * - * "First" means first mounted chronologically. If you create the - * following mounts: - * - * mount /dev/sda1 /mnt - * mount /dev/sda2 /mnt - * mount /dev/sda3 /mnt - * - * Then lookup_mnt() on the base /mnt dentry in the root mount will - * return successively the root dentry and vfsmount of /dev/sda1, then - * /dev/sda2, then /dev/sda3, then NULL. +/** + * lookup_mnt - Return the child mount mounted at given location + * @path: location in the namespace * - * lookup_mnt takes a reference to the found vfsmount. + * Acquires and returns a new reference to mount at given location + * or %NULL if nothing is mounted there. */ struct vfsmount *lookup_mnt(const struct path *path) { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH 52/52] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() 2025-08-25 4:43 ` [PATCH 52/52] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro @ 2025-08-25 13:42 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 13:42 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:55AM +0100, Al Viro wrote: > Comments regarding "shadow mounts" were stale - no such thing anymore. > Document the locking requirements for __lookup_mnt(). > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (50 preceding siblings ...) 2025-08-25 4:43 ` [PATCH 52/52] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro @ 2025-08-25 12:30 ` Christian Brauner 51 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:30 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Mon, Aug 25, 2025 at 05:43:04AM +0100, Al Viro wrote: > If anything, namespace_lock should be DEFINE_LOCK_GUARD_0, not DEFINE_GUARD. > That way we > * do not need to feed it a bogus argument > * do not get gcc trying to compare an address of static in > file variable with -4097 - and, if we are unlucky, trying to keep > it in a register, with spills and all such. > > The same problems apply to grabbing namespace_sem shared. > > Rename it to namespace_excl, add namespace_shared, convert the existing users: > > guard(namespace_lock, &namespace_sem) => guard(namespace_excl)() > guard(rwsem_read, &namespace_sem) => guard(namespace_shared)() > scoped_guard(namespace_lock, &namespace_sem) => scoped_guard(namespace_excl) > scoped_guard(rwsem_read, &namespace_sem) => scoped_guard(namespace_shared) > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-25 4:40 [PATCHED][RFC][CFT] mount-related stuff Al Viro 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro @ 2025-08-25 12:26 ` Christian Brauner 2025-08-25 12:43 ` Christian Brauner 2025-08-28 23:07 ` [PATCHES v2][RFC][CFT] " Al Viro 3 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:26 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, Jan Kara So for fun I asked one of these A.I. tools wtf "CFT" actually means. And I have to say it did not disappoint: Looking at this Linux kernel mailing list context about mount-related patches, "CFT" likely stands for "Call For Testing" in Al Viro's typical terse style. But since you asked for alternative interpretations: - Can't Find Testers - Completely Funtested Trash - Christian's Frustration Trigger - Cryptic Fileystem Torture - Carefully Fabricated Terrorcode - Code For Torvalds - Chaotic Fs Tweaking - Crash Friendly Technology - Coffee Fueled Tinkering - Confusing Fsdevel Tradition I vote for "Carefully Fabricated Terrorcode". On Mon, Aug 25, 2025 at 05:40:46AM +0100, Al Viro wrote: > Most of this pile is basically an attempt to see how well do > cleanup.h-style mechanisms apply in mount handling. That stuff lives in > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount > Rebased to -rc3 (used to be a bit past -rc2, branched at mount fixes merge) > Individual patches in followups. > > Please, help with review and testing. It seems to survive the > local beating and code generation seems to be OK, but more testing > would be a good thing and I would really like to see comments on that > stuff. > > This is not all I've got around mount handling, but I'd rather > get that thing out for review before starting to sort out other local > mount-related branches. > > Series overview: > > Part 1: guards. > > This part starts with infrastructure, followed by one-by-one > conversions to the guard/scoped_guard in some of the places that fit > that well enough. Note that one of those places turned out to be taking > mount_lock for no reason whatsoever; I already see places where we do > write_seqlock when read_seqlock_excl would suffice, etc. > > Folks, _please_ don't do any bulk conversions in that area. > IMO one area where RAII becomes dangerous is locking; usually it's not > a big deal to delay freeing some object a bit, but delay dropping a > lock and you risk introducing deadlocks that will be bloody hard to spot. > It _has_ to be done carefully; we had trouble in that area several times > over the last year or so in fs/namespace.c alone. Another fun problem > is that quite a few comments regarding the locking in there are stale. > We still have the comments that talk about mount lock as if it had been > an rwlock-like thing. It hadn't been that for more than a decade now. > It needs to be documented sanely; so do the access rules to the data > structures involved. I hope to get some of that into the tree this cycle, > but it's still in progress. > > 1/52) fs/namespace.c: fix the namespace_sem guard mess > New guards: namespace_excl and namespace_shared. The former implies > the latter, as for anything rwsem-like. No inode locks, no dropping the final > references, no opening files, etc. in scope of those. > 2/52) introduced guards for mount_lock > New guards: mount_writer, mount_locked_reader. That's write_seqlock > and read_seqlock_excl on mount_lock; obviously, nothing blocking should be > done in scope of those. > 3/52) fs/namespace.c: allow to drop vfsmount references via __free(mntput) > Missing DEFINE_FREE (for mntput()); local in fs/namespace.c, to be > used only for keeping shit out of namespace_... and mount_... scopes. > 4/52) __detach_mounts(): use guards > 5/52) __is_local_mountpoint(): use guards > 6/52) do_change_type(): use guards > 7/52) do_set_group(): use guards > 8/52) mark_mounts_for_expiry(): use guards > 9/52) put_mnt_ns(): use guards > 10/52) mnt_already_visible(): use guards > a bunch of clear-cut conversions, with explanations of the reasons > why this or that guard is needed. > 11/52) check_for_nsfs_mounts(): no need to take locks > ... and here we have one where it turns out that locking had been > excessive. Iterating through a subtree in mount_locked_reader scope is > safe, all right, but (1) mount_writer is not needed here at all and (2) > namespace_shared + a reference held to the root of subtree is also enough. > All callers had (2) already. Documented the locking requirements for > function, removed {,un}lock_mount_hash() in it... > 12/52) propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() > This one is interesting - existing code had been equivalent to > scoped_guard(mount_locked_reader), and it's right for that call. However, > mnt_set_mountpoint() generally requires mount_writer - the only reason we > get away with that here is that the mount in question never had been > reachable from the mounts visible to other threads. > 13/52) has_locked_children(): use guards > 14/52) mnt_set_expiry(): use guards > 15/52) path_is_under(): use guards > more clear-cut conversions with explanations. > 16/52) current_chrooted(): don't bother with follow_down_one() > 17/52) current_chrooted(): use guards > this pair might be better off with #16 taken to the beginning > of the series (or to a separate branch merge into this one); no better > reason to do as I had than wanting to keep the guard infrastructure > in the very beginning. > > Part 2: turning unlock_mount() into __cleanup. > > Environment for mounting something on given location consists of: > 1) namespace_excl scope > 2) parent mount - the one we'll be attaching things to. > 3) mountpoint to be, protected from disappearing under us. > 4) inode of that mountpoint's dentry held exclusive. > Unfortunately, we can't take inode locks in namespace_excl scopes. > And we want to cope with the possibility that somebody has managed to > mount something on that place while we'd been taking locks. "Cope" part > is simple for finish_automount() ("drop our mount and go away quietly; > somebody triggered it before we did"), but for everything else it's > trickier - "use whatever's overmounting that place now (with the right > locks, please)". > lock_mount() does all of that (do_lock_mount(), actually), with > unlock_mount() closing the scope. And it's definitely a good candidate > for __cleanup()-based approach, except that > * the damn thing can return an error and conditional variants of that > infrastructure are too revolting. > * parent mount is returned in a fucking awful way - we modify the struct > path passed to us as location to mount on and then its ->mnt is the parent > to be... except for the "beneath" variant where we play convoluted games > with "no, here we want the parent of that". Implementation is also > vulnerable to umount propagtion races. > * the structure we set up (everything except the parent) is inserted > into a linked list by lock_mount(). That excludes DEFINE_CLASS() - > it wants the value formed and then copied to the variable we are > defining. > * it contains an implicit namespace_excl scope, so path_put() and its > ilk *must* be done after the unlock_mount(). And most of the users have > gotos past that. > The first two problems are solved by adding an explicit pointer > to parent mount into struct pinned_mountpoint. Having lock_mount() > failure reported by setting it to ERR_PTR(-E...) allows to avoid the > problem with expressing the constructor failure. The third one is dealt > with by defining local macros to be used instead of CLASS - I went with > LOCK_MOUNT(mp, path) which defines struct pinned_mountpoint mp with > __cleanup(unlock_mount) and sets it up. If anybody has better suggestions, > I'll be glad to hear those. > The last one is dealt with by massaging the users to form that > would have all post-unlock_mount() stuff done by __free(). > > First, several trivial cleanups: > 18/52) do_move_mount(): trim local variables > 19/52) do_move_mount(): deal with the checks on old_path early > 20/52) move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() > 21/52) finish_automount(): simplify the ELOOP check > > Getting rid of post-unlock_mount() stuff: > 22/52) do_loopback(): use __free(path_put) to deal with old_path > 23/52) pivot_root(2): use __free() to deal with struct path in it > 24/52) finish_automount(): take the lock_mount() analogue into a helper > this one turns the open-coded logics into lock_mount_exact() with > the same kind of calling conventions as lock_mount() and do_lock_mount() > 25/52) do_new_mount_rc(): use __free() to deal with dropping mnt on failure > 26/52) finish_automount(): use __free() to deal with dropping mnt on failure > > This is the main part: > 27/52) change calling conventions for lock_mount() et.al. > > Followups, cleaning up the games with parent mount in the user: > 28/52) do_move_mount(): use the parent mount returned by do_lock_mount() > 29/52) do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path > 30/52) graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint > > Part 3: getting rid of mutating struct path there. > > do_lock_mount() is still playing silly buggers with struct path it > had been given - the logics in that thing hadn't changed. It's not a pretty > function and it's racy as well; the thing is, by this point its users have > almost no use for the changed contents of struct path - dentry can be derived > from struct mountpoint, parent mount to use is provided directly and we > want that a lot more than modified path->mnt. There's only one place > (in can_move_mount_beneath()) where we still want that and it's not hard > to reconstruct the value by *original* path->mnt value + parent mount to > be used. > > Getting rid of ->dentry uses. > 31/52) pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry > 32/52) don't bother passing new_path->dentry to can_move_mount_beneath() > > A helper, already open-coded in a couple of places; carved out of > the next patch to keep it reasonably small > 33/52) new helper: topmost_overmount() > > Rewrite of do_lock_mount() to keep path constant + trivial change > in do_move_mount() to adjust the argument it passes to can_move_mount_beneath(): > 34/52) do_lock_mount(): don't modify path. > > > Part 5: a bunch of trivial cleanups (mostly constifications) > > 35/52) constify check_mnt() > 36/52) do_mount_setattr(): constify path argument > 37/52) do_set_group(): constify path arguments > 38/52) drop_collected_paths(): constify arguments > 39/52) collect_paths(): constify the return value > 40/52) do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) > 41/52) mnt_warn_timestamp_expiry(): constify struct path argument > 42/52) do_new_mount{,_fc}(): constify struct path argument > 43/52) do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument > 44/52) path_mount(): constify struct path argument > 45/52) may_copy_tree(), __do_loopback(): constify struct path argument > 46/52) path_umount(): constify struct path argument > 47/52) constify can_move_mount_beneath() arguments > 48/52) do_move_mount_old(): use __free(path_put) > 49/52) do_mount(): use __free(path_put) > > Part 6: assorted stuff, will grow. > > 50/52) umount_tree(): take all victims out of propagation graph at once > [had been earlier] > For each removed mount we need to calculate where the slaves > will end up. To avoid duplicating that work, do it for all mounts to be > removed at once, taking the mounts themselves out of propagation graph as > we go, then do all transfers; the duplicate work on finding destinations > is avoided since if we run into a mount that already had destination > found, we don't need to trace the rest of the way. That's guaranteed > O(removed mounts) for finding destinations and removing from propagation > graph and O(surviving mounts that have master removed) for transfers. > > 51/52) ecryptfs: get rid of pointless mount references in ecryptfs dentries > ->lower_path.mnt has the same value for all dentries on given > ecryptfs instance and if somebody goes for mountpoint-crossing variant > where that would not be true, we can deal with that when it happens > (and _not_ with duplicating these reference into each dentry). > As it is, we are better off just sticking a reference into > ecryptfs-private part of superblock and keeping it pinned until > ->kill_sb(). > That way we can stick a reference to underlying dentry right into > ->d_fsdata of ecryptfs one, getting rid of indirection through struct > ecryptfs_dentry_info, along with the entire struct ecryptfs_dentry_info > machinery. > > 52/52) fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() > Comments regarding "shadow mounts" were stale - no such thing > anymore. Document the locking requirements for __lookup_mnt()... > > > FWIW, the current diffstat: > > fs/ecryptfs/dentry.c | 14 +- > fs/ecryptfs/ecryptfs_kernel.h | 27 +- > fs/ecryptfs/file.c | 15 +- > fs/ecryptfs/inode.c | 19 +- > fs/ecryptfs/main.c | 24 +- > fs/internal.h | 4 +- > fs/mount.h | 12 + > fs/namespace.c | 775 +++++++++++++++++++----------------------- > fs/pnode.c | 75 ++-- > fs/pnode.h | 1 + > include/linux/mount.h | 4 +- > kernel/audit_tree.c | 12 +- > 12 files changed, 464 insertions(+), 518 deletions(-) ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-25 4:40 [PATCHED][RFC][CFT] mount-related stuff Al Viro 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-08-25 12:26 ` [PATCHED][RFC][CFT] mount-related stuff Christian Brauner @ 2025-08-25 12:43 ` Christian Brauner 2025-08-25 16:11 ` Al Viro 2025-08-28 23:07 ` [PATCHES v2][RFC][CFT] " Al Viro 3 siblings, 1 reply; 321+ messages in thread From: Christian Brauner @ 2025-08-25 12:43 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, Jan Kara On Mon, Aug 25, 2025 at 05:40:46AM +0100, Al Viro wrote: > Most of this pile is basically an attempt to see how well do > cleanup.h-style mechanisms apply in mount handling. That stuff lives in > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount > Rebased to -rc3 (used to be a bit past -rc2, branched at mount fixes merge) > Individual patches in followups. > > Please, help with review and testing. It seems to survive the > local beating and code generation seems to be OK, but more testing > would be a good thing and I would really like to see comments on that > stuff. Btw, I just realized that basically none of your commits have any lore links in them. That kinda sucks because I very very often just look at a commit and then use the link to jump to the mailing list discussion for more context about a change and how it came about. So pretty please can you start adding lore links to your commits when applying if it's not fucking up your workflow too much? > > This is not all I've got around mount handling, but I'd rather > get that thing out for review before starting to sort out other local > mount-related branches. > > Series overview: > > Part 1: guards. > > This part starts with infrastructure, followed by one-by-one > conversions to the guard/scoped_guard in some of the places that fit > that well enough. Note that one of those places turned out to be taking > mount_lock for no reason whatsoever; I already see places where we do > write_seqlock when read_seqlock_excl would suffice, etc. > > Folks, _please_ don't do any bulk conversions in that area. > IMO one area where RAII becomes dangerous is locking; usually it's not > a big deal to delay freeing some object a bit, but delay dropping a > lock and you risk introducing deadlocks that will be bloody hard to spot. > It _has_ to be done carefully; we had trouble in that area several times > over the last year or so in fs/namespace.c alone. Another fun problem > is that quite a few comments regarding the locking in there are stale. > We still have the comments that talk about mount lock as if it had been > an rwlock-like thing. It hadn't been that for more than a decade now. > It needs to be documented sanely; so do the access rules to the data > structures involved. I hope to get some of that into the tree this cycle, > but it's still in progress. > > 1/52) fs/namespace.c: fix the namespace_sem guard mess > New guards: namespace_excl and namespace_shared. The former implies > the latter, as for anything rwsem-like. No inode locks, no dropping the final > references, no opening files, etc. in scope of those. > 2/52) introduced guards for mount_lock > New guards: mount_writer, mount_locked_reader. That's write_seqlock > and read_seqlock_excl on mount_lock; obviously, nothing blocking should be > done in scope of those. > 3/52) fs/namespace.c: allow to drop vfsmount references via __free(mntput) > Missing DEFINE_FREE (for mntput()); local in fs/namespace.c, to be > used only for keeping shit out of namespace_... and mount_... scopes. > 4/52) __detach_mounts(): use guards > 5/52) __is_local_mountpoint(): use guards > 6/52) do_change_type(): use guards > 7/52) do_set_group(): use guards > 8/52) mark_mounts_for_expiry(): use guards > 9/52) put_mnt_ns(): use guards > 10/52) mnt_already_visible(): use guards > a bunch of clear-cut conversions, with explanations of the reasons > why this or that guard is needed. > 11/52) check_for_nsfs_mounts(): no need to take locks > ... and here we have one where it turns out that locking had been > excessive. Iterating through a subtree in mount_locked_reader scope is > safe, all right, but (1) mount_writer is not needed here at all and (2) > namespace_shared + a reference held to the root of subtree is also enough. > All callers had (2) already. Documented the locking requirements for > function, removed {,un}lock_mount_hash() in it... > 12/52) propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() > This one is interesting - existing code had been equivalent to > scoped_guard(mount_locked_reader), and it's right for that call. However, > mnt_set_mountpoint() generally requires mount_writer - the only reason we > get away with that here is that the mount in question never had been > reachable from the mounts visible to other threads. > 13/52) has_locked_children(): use guards > 14/52) mnt_set_expiry(): use guards > 15/52) path_is_under(): use guards > more clear-cut conversions with explanations. > 16/52) current_chrooted(): don't bother with follow_down_one() > 17/52) current_chrooted(): use guards > this pair might be better off with #16 taken to the beginning > of the series (or to a separate branch merge into this one); no better > reason to do as I had than wanting to keep the guard infrastructure > in the very beginning. > > Part 2: turning unlock_mount() into __cleanup. > > Environment for mounting something on given location consists of: > 1) namespace_excl scope > 2) parent mount - the one we'll be attaching things to. > 3) mountpoint to be, protected from disappearing under us. > 4) inode of that mountpoint's dentry held exclusive. > Unfortunately, we can't take inode locks in namespace_excl scopes. > And we want to cope with the possibility that somebody has managed to > mount something on that place while we'd been taking locks. "Cope" part > is simple for finish_automount() ("drop our mount and go away quietly; > somebody triggered it before we did"), but for everything else it's > trickier - "use whatever's overmounting that place now (with the right > locks, please)". > lock_mount() does all of that (do_lock_mount(), actually), with > unlock_mount() closing the scope. And it's definitely a good candidate > for __cleanup()-based approach, except that > * the damn thing can return an error and conditional variants of that > infrastructure are too revolting. > * parent mount is returned in a fucking awful way - we modify the struct > path passed to us as location to mount on and then its ->mnt is the parent > to be... except for the "beneath" variant where we play convoluted games > with "no, here we want the parent of that". Implementation is also > vulnerable to umount propagtion races. > * the structure we set up (everything except the parent) is inserted > into a linked list by lock_mount(). That excludes DEFINE_CLASS() - > it wants the value formed and then copied to the variable we are > defining. > * it contains an implicit namespace_excl scope, so path_put() and its > ilk *must* be done after the unlock_mount(). And most of the users have > gotos past that. > The first two problems are solved by adding an explicit pointer > to parent mount into struct pinned_mountpoint. Having lock_mount() > failure reported by setting it to ERR_PTR(-E...) allows to avoid the > problem with expressing the constructor failure. The third one is dealt > with by defining local macros to be used instead of CLASS - I went with > LOCK_MOUNT(mp, path) which defines struct pinned_mountpoint mp with > __cleanup(unlock_mount) and sets it up. If anybody has better suggestions, > I'll be glad to hear those. > The last one is dealt with by massaging the users to form that > would have all post-unlock_mount() stuff done by __free(). > > First, several trivial cleanups: > 18/52) do_move_mount(): trim local variables > 19/52) do_move_mount(): deal with the checks on old_path early > 20/52) move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() > 21/52) finish_automount(): simplify the ELOOP check > > Getting rid of post-unlock_mount() stuff: > 22/52) do_loopback(): use __free(path_put) to deal with old_path > 23/52) pivot_root(2): use __free() to deal with struct path in it > 24/52) finish_automount(): take the lock_mount() analogue into a helper > this one turns the open-coded logics into lock_mount_exact() with > the same kind of calling conventions as lock_mount() and do_lock_mount() > 25/52) do_new_mount_rc(): use __free() to deal with dropping mnt on failure > 26/52) finish_automount(): use __free() to deal with dropping mnt on failure > > This is the main part: > 27/52) change calling conventions for lock_mount() et.al. > > Followups, cleaning up the games with parent mount in the user: > 28/52) do_move_mount(): use the parent mount returned by do_lock_mount() > 29/52) do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path > 30/52) graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint > > Part 3: getting rid of mutating struct path there. > > do_lock_mount() is still playing silly buggers with struct path it > had been given - the logics in that thing hadn't changed. It's not a pretty > function and it's racy as well; the thing is, by this point its users have > almost no use for the changed contents of struct path - dentry can be derived > from struct mountpoint, parent mount to use is provided directly and we > want that a lot more than modified path->mnt. There's only one place > (in can_move_mount_beneath()) where we still want that and it's not hard > to reconstruct the value by *original* path->mnt value + parent mount to > be used. > > Getting rid of ->dentry uses. > 31/52) pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry > 32/52) don't bother passing new_path->dentry to can_move_mount_beneath() > > A helper, already open-coded in a couple of places; carved out of > the next patch to keep it reasonably small > 33/52) new helper: topmost_overmount() > > Rewrite of do_lock_mount() to keep path constant + trivial change > in do_move_mount() to adjust the argument it passes to can_move_mount_beneath(): > 34/52) do_lock_mount(): don't modify path. > > > Part 5: a bunch of trivial cleanups (mostly constifications) > > 35/52) constify check_mnt() > 36/52) do_mount_setattr(): constify path argument > 37/52) do_set_group(): constify path arguments > 38/52) drop_collected_paths(): constify arguments > 39/52) collect_paths(): constify the return value > 40/52) do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) > 41/52) mnt_warn_timestamp_expiry(): constify struct path argument > 42/52) do_new_mount{,_fc}(): constify struct path argument > 43/52) do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument > 44/52) path_mount(): constify struct path argument > 45/52) may_copy_tree(), __do_loopback(): constify struct path argument > 46/52) path_umount(): constify struct path argument > 47/52) constify can_move_mount_beneath() arguments > 48/52) do_move_mount_old(): use __free(path_put) > 49/52) do_mount(): use __free(path_put) > > Part 6: assorted stuff, will grow. > > 50/52) umount_tree(): take all victims out of propagation graph at once > [had been earlier] > For each removed mount we need to calculate where the slaves > will end up. To avoid duplicating that work, do it for all mounts to be > removed at once, taking the mounts themselves out of propagation graph as > we go, then do all transfers; the duplicate work on finding destinations > is avoided since if we run into a mount that already had destination > found, we don't need to trace the rest of the way. That's guaranteed > O(removed mounts) for finding destinations and removing from propagation > graph and O(surviving mounts that have master removed) for transfers. > > 51/52) ecryptfs: get rid of pointless mount references in ecryptfs dentries > ->lower_path.mnt has the same value for all dentries on given > ecryptfs instance and if somebody goes for mountpoint-crossing variant > where that would not be true, we can deal with that when it happens > (and _not_ with duplicating these reference into each dentry). > As it is, we are better off just sticking a reference into > ecryptfs-private part of superblock and keeping it pinned until > ->kill_sb(). > That way we can stick a reference to underlying dentry right into > ->d_fsdata of ecryptfs one, getting rid of indirection through struct > ecryptfs_dentry_info, along with the entire struct ecryptfs_dentry_info > machinery. > > 52/52) fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() > Comments regarding "shadow mounts" were stale - no such thing > anymore. Document the locking requirements for __lookup_mnt()... > > > FWIW, the current diffstat: > > fs/ecryptfs/dentry.c | 14 +- > fs/ecryptfs/ecryptfs_kernel.h | 27 +- > fs/ecryptfs/file.c | 15 +- > fs/ecryptfs/inode.c | 19 +- > fs/ecryptfs/main.c | 24 +- > fs/internal.h | 4 +- > fs/mount.h | 12 + > fs/namespace.c | 775 +++++++++++++++++++----------------------- > fs/pnode.c | 75 ++-- > fs/pnode.h | 1 + > include/linux/mount.h | 4 +- > kernel/audit_tree.c | 12 +- > 12 files changed, 464 insertions(+), 518 deletions(-) ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-25 12:43 ` Christian Brauner @ 2025-08-25 16:11 ` Al Viro 2025-08-25 17:43 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-25 16:11 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds, Jan Kara On Mon, Aug 25, 2025 at 02:43:43PM +0200, Christian Brauner wrote: > On Mon, Aug 25, 2025 at 05:40:46AM +0100, Al Viro wrote: > > Most of this pile is basically an attempt to see how well do > > cleanup.h-style mechanisms apply in mount handling. That stuff lives in > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount > > Rebased to -rc3 (used to be a bit past -rc2, branched at mount fixes merge) > > Individual patches in followups. > > > > Please, help with review and testing. It seems to survive the > > local beating and code generation seems to be OK, but more testing > > would be a good thing and I would really like to see comments on that > > stuff. > > Btw, I just realized that basically none of your commits have any lore > links in them. That kinda sucks because I very very often just look at a > commit and then use the link to jump to the mailing list discussion for > more context about a change and how it came about. > > So pretty please can you start adding lore links to your commits when > applying if it's not fucking up your workflow too much? Links to what, at the first posting? Confused... ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-25 16:11 ` Al Viro @ 2025-08-25 17:43 ` Al Viro 2025-08-25 20:18 ` Theodore Ts'o 2025-08-26 8:56 ` Christian Brauner 0 siblings, 2 replies; 321+ messages in thread From: Al Viro @ 2025-08-25 17:43 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, Linus Torvalds, Jan Kara On Mon, Aug 25, 2025 at 05:11:14PM +0100, Al Viro wrote: > On Mon, Aug 25, 2025 at 02:43:43PM +0200, Christian Brauner wrote: > > On Mon, Aug 25, 2025 at 05:40:46AM +0100, Al Viro wrote: > > > Most of this pile is basically an attempt to see how well do > > > cleanup.h-style mechanisms apply in mount handling. That stuff lives in > > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount > > > Rebased to -rc3 (used to be a bit past -rc2, branched at mount fixes merge) > > > Individual patches in followups. > > > > > > Please, help with review and testing. It seems to survive the > > > local beating and code generation seems to be OK, but more testing > > > would be a good thing and I would really like to see comments on that > > > stuff. > > > > Btw, I just realized that basically none of your commits have any lore > > links in them. That kinda sucks because I very very often just look at a > > commit and then use the link to jump to the mailing list discussion for > > more context about a change and how it came about. > > > > So pretty please can you start adding lore links to your commits when > > applying if it's not fucking up your workflow too much? > > Links to what, at the first posting? Confused... I mean, this _is_ what I hope would be a discussion of that stuff - that's what request for comments stands for, after all. How is that supposed to work? Going back through the queue and slapping lore links at the same time as the reviewed-by etc. are applied? I honestly have no idea what practice do you have in mind - ~95% of the time I'm sitting in nvi - it serves as IDE for me; mutt takes a large part of the rest. Browser is something that gets used occasionally when I have to... ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-25 17:43 ` Al Viro @ 2025-08-25 20:18 ` Theodore Ts'o 2025-08-26 8:56 ` Christian Brauner 1 sibling, 0 replies; 321+ messages in thread From: Theodore Ts'o @ 2025-08-25 20:18 UTC (permalink / raw) To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, Linus Torvalds, Jan Kara On Mon, Aug 25, 2025 at 06:43:12PM +0100, Al Viro wrote: > I mean, this _is_ what I hope would be a discussion of that stuff - > that's what request for comments stands for, after all. How is that > supposed to work? Going back through the queue and slapping lore links > at the same time as the reviewed-by etc. are applied? Lore links are useful when a maintainer is applying someone else's patches into their git tree. I think that's what Christian was thinking about. In this case, however, where the maintainer is the one autoring/sending the patches the patches, there is the chicken-and-egg prblem that you've described, and so I don't understand why Christian has made that request. Usually I just construct the lore URL from the Message ID from the patch series, but what I've seen other olks do for very large patch sets is that they'll also publish the patches on git, for example from Darrick's recent fuse/iomap patches, he included a link in the patchset cover letter to: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache - Ted ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-25 17:43 ` Al Viro 2025-08-25 20:18 ` Theodore Ts'o @ 2025-08-26 8:56 ` Christian Brauner 2025-08-27 17:19 ` Linus Torvalds 1 sibling, 1 reply; 321+ messages in thread From: Christian Brauner @ 2025-08-26 8:56 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, Jan Kara On Mon, Aug 25, 2025 at 06:43:12PM +0100, Al Viro wrote: > On Mon, Aug 25, 2025 at 05:11:14PM +0100, Al Viro wrote: > > On Mon, Aug 25, 2025 at 02:43:43PM +0200, Christian Brauner wrote: > > > On Mon, Aug 25, 2025 at 05:40:46AM +0100, Al Viro wrote: > > > > Most of this pile is basically an attempt to see how well do > > > > cleanup.h-style mechanisms apply in mount handling. That stuff lives in > > > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount > > > > Rebased to -rc3 (used to be a bit past -rc2, branched at mount fixes merge) > > > > Individual patches in followups. > > > > > > > > Please, help with review and testing. It seems to survive the > > > > local beating and code generation seems to be OK, but more testing > > > > would be a good thing and I would really like to see comments on that > > > > stuff. > > > > > > Btw, I just realized that basically none of your commits have any lore > > > links in them. That kinda sucks because I very very often just look at a > > > commit and then use the link to jump to the mailing list discussion for > > > more context about a change and how it came about. > > > > > > So pretty please can you start adding lore links to your commits when > > > applying if it's not fucking up your workflow too much? > > > > Links to what, at the first posting? Confused... > > I mean, this _is_ what I hope would be a discussion of that stuff - > that's what request for comments stands for, after all. How is that > supposed to work? Going back through the queue and slapping lore links > at the same time as the reviewed-by etc. are applied? I honestly have > no idea what practice do you have in mind - ~95% of the time I'm sitting > in nvi - it serves as IDE for me; mutt takes a large part of the rest. > Browser is something that gets used occasionally when I have to... You misunderstand. Once you apply your series to the tree that you intend to merge simply add the lore links to the patches of the last version. I don't give a single damn whether someone _sends_ patches with lore links. That is not what this is about. I care that I can git log at mainline and figure out where that patch was discussed, pull down the discussion via b4 or other tooling, without having to search lore. IOW, what I asked you about is once the patches end up in mainline they please have links to the discussion where they came from. I do it for all patches no matter if I pick them up from someone else or if I'm applying my own: commit c237aa9884f238e1480897463ca034877ca7530b Author: Christian Brauner <brauner@kernel.org> kernfs: don't fail listing extended attributes <snip> Link: https://lore.kernel.org/20250819-ahndung-abgaben-524a535f8101@brauner ^^^^^^^^^^^^^^^^^ Signed-off-by: Christian Brauner <brauner@kernel.org> I'm not doing that for my own personal wellness cure but for every other poor bastard (granted, including me because one year later it's all swapped out) who looks at commits in the git tree and wants to either jump to a link in the browser or wants to use tooling to just pull the whole discussion from the list. ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-26 8:56 ` Christian Brauner @ 2025-08-27 17:19 ` Linus Torvalds 2025-08-27 17:49 ` Linus Torvalds 0 siblings, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-08-27 17:19 UTC (permalink / raw) To: Christian Brauner; +Cc: Al Viro, linux-fsdevel, Jan Kara On Tue, 26 Aug 2025 at 01:56, Christian Brauner <brauner@kernel.org> wrote: > > I'm not doing that for my own personal wellness cure Please only do this for things that were actually discussed. Because for *my* wellness cure, I get really damn annoyed when I wonder about some context of a commit, and follow a link to look at the background, and all I see is that SAME DAMN PATCH that I already looked at, and wondered about, then that link damn well wasted my time. It's annoying as hell. And no, some "maybe people add acks or context later" is not a valid reason to add a link. If there was no discussion about it at the time it was committed, a link to some mailing list posting by definition doesn't explain why the commit exists. Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-27 17:19 ` Linus Torvalds @ 2025-08-27 17:49 ` Linus Torvalds 2025-08-27 22:49 ` Konstantin Ryabitsev 0 siblings, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-08-27 17:49 UTC (permalink / raw) To: Christian Brauner; +Cc: Al Viro, linux-fsdevel, Jan Kara On Wed, 27 Aug 2025 at 10:19, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And no, some "maybe people add acks or context later" is not a valid > reason to add a link. If there was no discussion about it at the time > it was committed, a link to some mailing list posting by definition > doesn't explain why the commit exists. Side note: relevant later discussion of patches obviously does happen, but it's actually more likely to be independent of the mailing list posting, and instead refer to the commit ID - and the shortlog of the commit - than to the original posting. Yes, some bots do obviously traverse the mailing list for patch series to look at and test, but those bots are the ones that the developer / maintainer should have reacted to *before* the commit goes upstream, so finding them after-the-fact is simply not a high priority. A much more common thing is that the "context added later" is a result of people and bots reporting problems with a commit that has hit the git trees, and they do *not* generally reply to the original posting. So instead those much more relevant reports will typically make an entirely new thread, mentioning the commit ID and the subject line. Which is why I think it is so bass-ackwards to add a link to the posting in the commit. That literally is useless garbage unless the posting generated discussion. The link to the posting is not likely to be the most relevant thing: it tends to be *much* more productive to instead search lore for the commit ID and the subject line of the commit. That will obviously find the original posting of the patch too, but it will *also* find those much more relevant and likely reports about people/bots reporting issues with a commit in the git tree. This is why I hate those pointless links so much. They are worthless garbage. And the "but maybe somebody adds context later" is intellectually dishonest, since that later context is likely *not* found behind that link, but through other means entirely. Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-27 17:49 ` Linus Torvalds @ 2025-08-27 22:49 ` Konstantin Ryabitsev 2025-08-27 23:40 ` Linus Torvalds 0 siblings, 1 reply; 321+ messages in thread From: Konstantin Ryabitsev @ 2025-08-27 22:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christian Brauner, Al Viro, linux-fsdevel, Jan Kara On Wed, Aug 27, 2025 at 10:49:21AM -0700, Linus Torvalds wrote: > Which is why I think it is so bass-ackwards to add a link to the > posting in the commit. That literally is useless garbage unless the > posting generated discussion. The link to the posting is not likely to > be the most relevant thing: it tends to be *much* more productive to > instead search lore for the commit ID and the subject line of the > commit. Main trouble is that we can't always reliably arrive at the source of the patch in lore based on the commit. The subject line can be tricky to search for if it uses quotes, brackets, or other characters that aren't reliably tokenized. Furthermore, there can be situations where the results can be ambiguous. For example, a [PATCH v7] could have been posted after the maintainer had already accepted [PATCH v6], in which case the maintainer will ask for a new bugfix series to be sent instead. Similarly, we can't reliably go from the commit to the patch-id that we can use to search the archives: - the maintainer may have rebased the patch series, resulting in a different patch-id - the original submission may have been generated with a different patch algorithm (histogram vs. myers is the usual culprit) - the maintainer may have tweaked the patch for cosmetic reasons All of the above may result in a different git-patch-id that no longer matches the original submission. I have recommended that Link: trailers indicating the provenance of the series should use a dedicated domain name: patch.msgid.link. This should clearly indicate to you that following this link will take you to the original submission, not to any other discussion. I haven't yet made this the default in b4, but I should probably do that. Anyone can already make this their default by setting the following in their .gitconfig: [b4] linkmask = https://patch.msgid.link/%s -K ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-27 22:49 ` Konstantin Ryabitsev @ 2025-08-27 23:40 ` Linus Torvalds 2025-08-28 0:41 ` Konstantin Ryabitsev 0 siblings, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-08-27 23:40 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: Christian Brauner, Al Viro, linux-fsdevel, Jan Kara On Wed, 27 Aug 2025 at 15:49, Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > I have recommended that Link: trailers indicating the provenance of the series > should use a dedicated domain name: patch.msgid.link. This should clearly > indicate to you that following this link will take you to the original > submission, not to any other discussion. That doesn't fix anything. It only reinforces the basic stupidity of marking the WRONG DIRECTION. The fact is, YOU CANNOT SANELY MARK THE COMMIT. Dammit, why do people ignore this *fundamental* issue? You literally cannot add information to the commit that doesn't exist yet, and the threads that refer to bugs etc quite fundamentally WILL NOT EXIST YET when the commit is posted. The actual *useful* information about a commit is the discussions it resulted in, not the posting of the patch. And those will almost invariably be unrelated to the patch submission, since they either talked about the problems that the patch *fixed*, or talk about the problems that the patch *caused* (ie the thread starts with some random "My machine no longer boots", and then goes on from there as people try to figure out what caused it. So the *relevant* links are pretty much by definition not the link to the posting of the patch. Is it really so hard to understand and accept this fundamental issue? It's the *message* that should be indexed and marked, not the commit. What you want to find is messages on the mailing list that mention the commit, not the other way around. The other way around is completely pointless and CANNOT BE AUTOMATED. Any automation by definition will only add noise, not "information". Really. The only valid link is a link to *pre-existing* discussion, not to some stupid "this is where I posted this patch", which is entirely and utterly immaterial. And dammit, lore could do this. Here's one suggested model that at least gets the direction of indexing right (I'm not claiming it's the only model, or the best model, but it sure as hell beats getting the fundamentals completely wrong): (a) messages with patches can be indexed by the patch-id of said patch This might well be useful in its own right ("search for this patch"), and would be good for the series where the same patch ends up being re-posted because the whole series was re-posted. IOW, just that trivial thing would already allow the lore web interface to link to "this patch has been posted before", which is useful information on its own, totally aside from any future archeology. But it's not the end goal, it's only a small step to *get* to the end goal: (b) messages that mention a commit ID (or a subject line) could then have referrals to the patch-id of said commit. No, you don't want to do a whole-text search every time you look for a commit. That's fine for manual stuff, but it's much too expensive for any sane automation. But you *can* (and lore already does) scan messages at message posting time, and find when people refer to a commit, and then index that message *once* by the patch ID of the commit. Now, this *is* fundamentally useful in a very different way: if you have somebody who bisected something and mentions a commit as a result, you'd now *find* that kind of message, and the history leading up to it. So when people read threads on lore about bugs being bisected, think how useful it would be if that thread would basically auto-populate with "this message refers to this patch". And the final step is (c) have some 'b4' infrastructure to look up emails pertaining to a commit - by doing the patch ID and then looking up the indexing above Look, now you have a "open web browser with the history of not just where the patch was originally posted, but where that commit was *mentioned*". Notice how fundamentally more useful this is from some link to where the patch was posted? And absolutely nothing in the above implies tagging the commit with useless information. I look at the "Link:" tags quite regularly, and I can tell you that when it's a posting tag, it almost invariably is completely and totally useless. We *have* people who add those, and they only add noise and very little value. Do not add more of those useless garbage links in the name of "automation". It's not automating anything useful, it's only automating garbage. Because the *commit* already has all the information that is relevant - it's not the commit that is missing a link. It's the other side. Which is why those links to lore patch submission events are so STUPID. They add nothing. Doing them in the name of "automation" is crazy. It's entirely pointless. It's garbage and it's mis-designed, because it's not understanding the problem. Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-27 23:40 ` Linus Torvalds @ 2025-08-28 0:41 ` Konstantin Ryabitsev 2025-08-28 1:00 ` Al Viro 2025-08-28 1:29 ` Linus Torvalds 0 siblings, 2 replies; 321+ messages in thread From: Konstantin Ryabitsev @ 2025-08-28 0:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: Christian Brauner, Al Viro, linux-fsdevel, Jan Kara On Wed, Aug 27, 2025 at 04:40:58PM -0700, Linus Torvalds wrote: > On Wed, 27 Aug 2025 at 15:49, Konstantin Ryabitsev > <konstantin@linuxfoundation.org> wrote: > > > > I have recommended that Link: trailers indicating the provenance of the series > > should use a dedicated domain name: patch.msgid.link. This should clearly > > indicate to you that following this link will take you to the original > > submission, not to any other discussion. > > That doesn't fix anything. It only reinforces the basic stupidity of > marking the WRONG DIRECTION. > > The fact is, YOU CANNOT SANELY MARK THE COMMIT. Dammit, why do people > ignore this *fundamental* issue? You literally cannot add information > to the commit that doesn't exist yet, and the threads that refer to > bugs etc quite fundamentally WILL NOT EXIST YET when the commit is > posted. I'm not sure what you mean. The Link: trailer is added when the maintainer pulls in the series into their tree. It's not put there by the submitter. The maintainer marks a reliable mapping of "this commit came from this thread" and we the use this info for multiple purposes: 1. letting the submitter know when their series is accepted into the maintainer's tree 2. marking the series as "mainlined" when we find that commit in your tree 3. it reliably marks provenance for tools like cregit, which largely have to guess this info It serves a real purpose. > It's the *message* that should be indexed and marked, not the commit. We cannot *reliably* map commits to patches. A commit can be represented as any number of patches, all resulting in different patch-id's -- it can be generated with a different number of context lines, with a different patch algorithm, it could have been rebased, etc. Maintainers do edit patches they receive, including the subject lines. I know, because attempting to automate things without a provenance Link: results in false-positives for projects like netdev. > Really. The only valid link is a link to *pre-existing* discussion, > not to some stupid "this is where I posted this patch", which is > entirely and utterly immaterial. > > And dammit, lore could do this. Here's one suggested model that at > least gets the direction of indexing right (I'm not claiming it's the > only model, or the best model, but it sure as hell beats getting the > fundamentals completely wrong): > > (a) messages with patches can be indexed by the patch-id of said patch They already do, it's been there for a long time now. Here's a random one: https://lore.kernel.org/lkml/?q=patchid%3A09b124c33929efcffe0ce8df0a805f54d5962f60 > This might well be useful in its own right ("search for this patch"), > and would be good for the series where the same patch ends up being > re-posted because the whole series was re-posted. This is how we are able to pull in trailers sent to previous series, if the patch-id hasn't changed. > IOW, just that trivial thing would already allow the lore web > interface to link to "this patch has been posted before", which is > useful information on its own, totally aside from any future > archeology. > > But it's not the end goal, it's only a small step to *get* to the end goal: > > (b) messages that mention a commit ID (or a subject line) could then > have referrals to the patch-id of said commit. To reiterate, a commit is not a patch, so *we cannot reliably arrive from a commit to always the same patch-id*. We've discovered it the hard way when you recommended that people send you patches with --histogram and we suddenly could no longer reliably map commits to patches, because on our end we generated patches with the default (myers) and they did not match the patches generated with --histogram, so our automation broke. This is what I am trying to convey -- commits don't reliably map to patches, because the same commit can generate any number of perfectly valid patches, all with different patch-id's. -K ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-28 0:41 ` Konstantin Ryabitsev @ 2025-08-28 1:00 ` Al Viro 2025-08-28 1:15 ` Konstantin Ryabitsev 2025-08-28 1:29 ` Linus Torvalds 1 sibling, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 1:00 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Wed, Aug 27, 2025 at 08:41:02PM -0400, Konstantin Ryabitsev wrote: > I'm not sure what you mean. The Link: trailer is added when the maintainer > pulls in the series into their tree. It's not put there by the submitter. The > maintainer marks a reliable mapping of "this commit came from this thread" and > we the use this info for multiple purposes: You are overloading the terms here - "pull" as in (basically) git am and "pull" as in git pull and its ilk... And I still don't understand how is that supposed to apply when patches are _developed_ in git branches. In situation when submitter == maintainer. ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-28 1:00 ` Al Viro @ 2025-08-28 1:15 ` Konstantin Ryabitsev 0 siblings, 0 replies; 321+ messages in thread From: Konstantin Ryabitsev @ 2025-08-28 1:15 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, Jan Kara On Thu, Aug 28, 2025 at 02:00:17AM +0100, Al Viro wrote: > > I'm not sure what you mean. The Link: trailer is added when the maintainer > > pulls in the series into their tree. It's not put there by the submitter. The > > maintainer marks a reliable mapping of "this commit came from this thread" and > > we the use this info for multiple purposes: > > You are overloading the terms here - "pull" as in (basically) git am and "pull" > as in git pull and its ilk... > > And I still don't understand how is that supposed to apply when patches are > _developed_ in git branches. In situation when submitter == maintainer. Then there's no external provenance, so there is no need for this kind of mapping. You will submit your changes as a pull request and you'll get notified when it's merged (via the PR tracker bot). There is a hybrid workflow as well: - maintainer develops a patch series - maintainer sends it to the list for review - maintainer pulls in the trailers In that case, we don't automatically put provenance trailers into patches, but you can still achieve the same result if instead of merging your local branch you merge the series from the list, but this is more of a corner case scenario. -K ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-28 0:41 ` Konstantin Ryabitsev 2025-08-28 1:00 ` Al Viro @ 2025-08-28 1:29 ` Linus Torvalds 2025-08-29 12:30 ` Theodore Ts'o 1 sibling, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-08-28 1:29 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: Christian Brauner, Al Viro, linux-fsdevel, Jan Kara On Wed, 27 Aug 2025 at 17:41, Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > I'm not sure what you mean. The Link: trailer is added when the maintainer > pulls in the series into their tree. That's my point. Adding it to the commit at that point is entirely useless, because (a) that email doesn't have the *reason* for the patch (or rather, if it does, then the link to the email is pointless, since the *real* reason was mentioned already) (b) at that point clearly it doesn't have any *problems* associated with it either, since if it did, it shouldn't have been included in the first place. So there is absolutely zero information in the link. It's pure pointless noise. > maintainer marks a reliable mapping of "this commit came from this thread" and > > It serves a real purpose. It damn well does not serve any purpose at all, because there is nothing useful there. Your logic isn't logic - it's just empty words. I can come up with tons of "reliable mappings". How about we make the automation add the weather.com report for the weather in Kuala Lumpur when b4 downloads the series? We could do that reliably too. Notice how the reliability of something is entirely irrelevant. Just because you can reliably automate it doesn't make it relevant information. And dammit, it's WORSE than worthless information. I _constantly_ end up being disappointed by those useless links, and I've wasted time following them in the hope of finding something useful. So it's actually reliably NEGATIVE information that wastes peoples time. > We cannot *reliably* map commits to patches. What we care about is about things being *USEFUL*. "Reliable" is entirely irrelevant if it's not useful. Because reliable but useless is still useless. And always will be. So I'll take "Useful information that you might not always have", every single time over "Useless, but always there". Get it? Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-28 1:29 ` Linus Torvalds @ 2025-08-29 12:30 ` Theodore Ts'o 2025-08-29 18:25 ` Konstantin Ryabitsev 0 siblings, 1 reply; 321+ messages in thread From: Theodore Ts'o @ 2025-08-29 12:30 UTC (permalink / raw) To: Linus Torvalds Cc: Konstantin Ryabitsev, Christian Brauner, Al Viro, linux-fsdevel, Jan Kara On Wed, Aug 27, 2025 at 06:29:50PM -0700, Linus Torvalds wrote: > On Wed, 27 Aug 2025 at 17:41, Konstantin Ryabitsev > <konstantin@linuxfoundation.org> wrote: > > > > I'm not sure what you mean. The Link: trailer is added when the maintainer > > pulls in the series into their tree. > > That's my point. Adding it to the commit at that point is entirely > useless, because > > (a) that email doesn't have the *reason* for the patch (or rather, if > it does, then the link to the email is pointless, since the *real* > reason was mentioned already) From a maintainer's perspective, the reason why I keep the link in is because I'm dumb-ass lazy. My workflow involves looking at patchwork, cutting-and-pasting the Message-Id, and then passing it to b4. Looking through a 20 patch series to figure out which one rates a Link: trailer, and which one doesn't is a pain in the *ss, and in the off-chance that there *is* a meaningful and deep discussion, it would be nice to be able to capture it. But it might be in patch #4; or patch #12; or patches #14 and #15. Also, there might be an extended conversation thread in the patch series description (patch #0) and it would be useful to be able to get a link to it. So here's a set of feature requests for b4. (a) It would be cool(tm), if there was a way for b4 to automatically detect whether or not there was a reply to a patch at the time that "b4 am" is run; if there is, to include the patch series. If there isn't an e-mail reply, skip the the Link: trailer. (b) In the case of a patch series, it would be useful to include some kind of trailer indicating that a group of patches are logically grouped together (maybe a patch-series: that has the message id to the the series header, or the first patch if there is no patch #0) --- because one of the other ways that I figure out that a series of commits are part of a patch series is by looking at the Link: field since if the messages are generated using "git send-email" it's usually obvious from the message id. This has also come up from some of the folks who want this for their web-based review systems. I don't care about that, but if it solves multiple use cases at once, that's great. (c) Include a link tag to the patch series description e-mail message (if present) in the first commit of the patch series so it's possible to read the patch #0 description of the patch series, since otherwise this can get be hard to find in the git history. (d) For bonus points, if there is a way to determine a link to the previous versions of the patch series, it would be useful for to incude link: tags to previous versions of the patch if and only if there were e-mail comments to say, the v5, v12, and v27 versions of the patch. (e) If there is some way we can easily capture lore.kernel.org URL for the vN-1 version of the patch series in the vN commit description header, in "b4 prep" that would be *excellent*. I don't think it can do this today, but if it can, can we make sure it's defaulted to on, and then we should **really** market the heck out of b4 prep? The bottom line is I'd love to make Linus less cranky; but I'd also love it if I didn't have to do the extra work by hand. :-) Because if I do have to do it by hand, I will probably screw up, and my preference has been to err on the side of having the link, so it's there when I'm having to code code archeology --- even if most of the time it's not strictly speaking necessary. Cheers, - Ted ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHED][RFC][CFT] mount-related stuff 2025-08-29 12:30 ` Theodore Ts'o @ 2025-08-29 18:25 ` Konstantin Ryabitsev 0 siblings, 0 replies; 321+ messages in thread From: Konstantin Ryabitsev @ 2025-08-29 18:25 UTC (permalink / raw) To: Theodore Ts'o Cc: Linus Torvalds, Christian Brauner, Al Viro, linux-fsdevel, Jan Kara On Fri, Aug 29, 2025 at 08:30:33AM -0400, Theodore Ts'o wrote: > So here's a set of feature requests for b4. > > (a) It would be cool(tm), if there was a way for b4 to automatically > detect whether or not there was a reply to a patch at the time that > "b4 am" is run; if there is, to include the patch series. If there > isn't an e-mail reply, skip the the Link: trailer. I'm afraid this would mostly breed confusion. > (b) In the case of a patch series, it would be useful to include some > kind of trailer indicating that a group of patches are logically > grouped together (maybe a patch-series: that has the message id to > the the series header, or the first patch if there is no patch #0) > --- because one of the other ways that I figure out that a series > of commits are part of a patch series is by looking at the Link: > field since if the messages are generated using "git send-email" > it's usually obvious from the message id. This has also come up > from some of the folks who want this for their web-based review > systems. I don't care about that, but if it solves multiple use > cases at once, that's great. This is already in place with the change-id trailer (and the corresponding X-Change-ID email header). However, only b4 puts those in. Series prepared and sent with git-send-email don't have any identifier like that. > (c) Include a link tag to the patch series description e-mail message > (if present) in the first commit of the patch series so it's > possible to read the patch #0 description of the patch series, > since otherwise this can get be hard to find in the git history. We're talking about the lore.kernel.org web interface? > (d) For bonus points, if there is a way to determine a link to the > previous versions of the patch series, it would be useful for to > incude link: tags to previous versions of the patch if and only if > there were e-mail comments to say, the v5, v12, and v27 versions > of the patch. Again, are we talking in the context of the lore.kernel.org web interface? The initial discussion about Link: tags was about them being present in git commits. > (e) If there is some way we can easily capture lore.kernel.org URL for > the vN-1 version of the patch series in the vN commit description > header, in "b4 prep" that would be *excellent*. I don't think it > can do this today, but if it can, can we make sure it's defaulted > to on, and then we should **really** market the heck out of b4 > prep? You can do this for any b4-prep sent series by just searching for the change-id string. E.g.: https://lore.kernel.org/lkml/?q=20241018-pmu_event_info-986e21ce6bd3 `b4 prep` is used quite extensively these days, but it's far from being predominant. > The bottom line is I'd love to make Linus less cranky; but I'd also love > it if I didn't have to do the extra work by hand. :-) Because if I do > have to do it by hand, I will probably screw up, and my preference has > been to err on the side of having the link, so it's there when I'm > having to code code archeology --- even if most of the time it's not > strictly speaking necessary. This doesn't ultimately solve the problem that we're butting heads about -- that it's impossible to reliably match a commit to its provenance. Using Link: trailers indicating where the patch came from is the only reliable mechanism we have thus far, because it establishes this relationship unequivocally. However, these links annoy Linus, who would like this to be automated in some other way behind the scenes. I'd love to be able to do so, but short of running some kind of "provenance transparency log" of curated commit -> message-id mappings, I don't see how it's possible. -K ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCHES v2][RFC][CFT] mount-related stuff 2025-08-25 4:40 [PATCHED][RFC][CFT] mount-related stuff Al Viro ` (2 preceding siblings ...) 2025-08-25 12:43 ` Christian Brauner @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-09-03 4:54 ` [PATCHES v3][RFC][CFT] mount-related stuff Al Viro 3 siblings, 2 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner, Jan Kara Branch force-pushed into git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount (also visible as #v2.mount, #v1.mount being the previous version) Individual patches in followups. Still -rc3-based, seems to survive local beating. Please, help with review and testing. Note: no links in commits, I still don't understand what kind of use is expected in this situation. Changes since v1 (aside of reviewed-by applied): In #13, #14 and #15 scoped_guard replaced with guard. I don't like it, but I can live with it. Between old #18 and #19: do_new_mount_fc() switched to use of fc_mount(). vfs_get_tree() call moved from the caller into the function itself, unlock + vfs_create_mount() reordered to before the checks in there and collapsed with vfs_get_tree() into a call of fc_mount(). Cleanup aside, that avoids the difference between the lexical scope of mnt and the actual lifetime of that reference. Differs from the variant posted in https://lore.kernel.org/all/20250826182124.GV39973@ZenIV/ only by fixing an obvious braino - fetching fc->root->d_sb should be done after successful fc_mount(), not before it. That change modifies old #25 (now #26) "do_new_mount_rc(): use __free() to deal with dropping mnt on failure". Added to the end of queue: cleanup of populating a new namespace with a tree (open_detached_copy() and copy_mnt_ns()); both end up using guards, BTW. 5 commits, #54..#58 * open_detached_copy(): don't bother with mount_lock_hash() It's useless there right now - namespace_excl is quite enough. * open_detached_copy(): separate creation of namespace into helper Creation of namespace and opening that FMODE_NEED_UNMOUNT file are better off separated - cleaner that way. * mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Currently it (and free_mnt_ns()) can't be used with non-anon namespace before the insertion into mnt_ns_tree; very easy to make it work in such situation as well - in fact, the old "is it non-anonymous" check is not needed anymore. * copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Use the previous patch to avoid weird open-coding of free_mnt_ns(). * copy_mnt_ns(): use guards ... and __free(mntput) for rootmnt/pwdmnt. Added to the end of queue: handling of ->s_mounts/->mnt_instance and mnt_hold_writers(). Each mount is associated with the same dentry (sub)tree of the same filesystem through its entire lifetime. They are allocated empty, then (in the same function that had called allocator) attached to dentry tree and stay like that all the way to destructor (cleanup_mnt()). Unfortunately, as soon as they are attached to a tree, they become reachable from shared data structures - we maintain the set of all mounts associated with given superblock. Having to worry about that while we are still setting them up is inconvenient. Thankfully, the accesses via that set are *very* limited - only sb_prepare_remount_readonly() goes there and the only thing it does to a mount is setting/clearing MNT_WRITE_HOLD and checking the write count (guaranteed to be zero during setup, since there's nobody who could've asked for write access by that point). Turns out it's easy to take MNT_WRITE_HOLD out of ->mnt_flags and basically move it into the same thing that establishes linkage in per-superblock set of mounts. That makes accesses via that set isolated from the rest of struct mount; as far as we are concerned, this set is no longer a way to reach the mount from shared data structures and mount remains private to caller until it is explicitly made reachable (by mounting, attaching to overlayfs as a layer, etc.). FWIW, I think we should get rid of the "empty" state of struct mount and have allocator take the root dentry as additional argument. Hadn't done that yet; this series removes the need to delay attaching a partially set up mount to filesystem - we can do that from the very beginning now. 5 commits, #59..#63 * setup_mnt(): primitive for connecting a mount to filesystem Identical logics in clone_mnt() and vfs_create_mount() => common helper * preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Change the representation of set from list_head list to something equivalent to hlist one, with forward linkage going to the entire struct mount rather than embedded hlist_node. * struct mount: relocate MNT_WRITE_HOLD bit Steal the LSB of back links in the set representation to store it. We only traverse the list forwards and all changes are under mount_lock, same as for all mnt_hold_writers()/mnt_unhold_writers() pairs, so it's pretty uncomplicated. * simplify the callers of mnt_unhold_writers() * WRITE_HOLD machinery: no need for to bump mount_lock seqcount The last part is another group of "we only need mount_locked_reader" cases Diffstat: fs/ecryptfs/dentry.c | 14 +- fs/ecryptfs/ecryptfs_kernel.h | 27 +- fs/ecryptfs/file.c | 15 +- fs/ecryptfs/inode.c | 19 +- fs/ecryptfs/main.c | 24 +- fs/internal.h | 4 +- fs/mount.h | 16 +- fs/namespace.c | 989 +++++++++++++++++++----------------------- fs/pnode.c | 75 +++- fs/pnode.h | 1 + fs/super.c | 3 +- include/linux/fs.h | 2 +- include/linux/mount.h | 7 +- kernel/audit_tree.c | 12 +- 14 files changed, 573 insertions(+), 635 deletions(-) ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess 2025-08-28 23:07 ` [PATCHES v2][RFC][CFT] " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 02/63] introduced guards for mount_lock Al Viro ` (61 more replies) 2025-09-03 4:54 ` [PATCHES v3][RFC][CFT] mount-related stuff Al Viro 1 sibling, 62 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds If anything, namespace_lock should be DEFINE_LOCK_GUARD_0, not DEFINE_GUARD. That way we * do not need to feed it a bogus argument * do not get gcc trying to compare an address of static in file variable with -4097 - and, if we are unlucky, trying to keep it in a register, with spills and all such. The same problems apply to grabbing namespace_sem shared. Rename it to namespace_excl, add namespace_shared, convert the existing users: guard(namespace_lock, &namespace_sem) => guard(namespace_excl)() guard(rwsem_read, &namespace_sem) => guard(namespace_shared)() scoped_guard(namespace_lock, &namespace_sem) => scoped_guard(namespace_excl) scoped_guard(rwsem_read, &namespace_sem) => scoped_guard(namespace_shared) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index ae6d1312b184..fcea65587ff9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -82,6 +82,12 @@ static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ static struct mnt_namespace *emptied_ns; /* protected by namespace_sem */ static DEFINE_SEQLOCK(mnt_ns_tree_lock); +static inline void namespace_lock(void); +static void namespace_unlock(void); +DEFINE_LOCK_GUARD_0(namespace_excl, namespace_lock(), namespace_unlock()) +DEFINE_LOCK_GUARD_0(namespace_shared, down_read(&namespace_sem), + up_read(&namespace_sem)) + #ifdef CONFIG_FSNOTIFY LIST_HEAD(notify_list); /* protected by namespace_sem */ #endif @@ -1776,8 +1782,6 @@ static inline void namespace_lock(void) down_write(&namespace_sem); } -DEFINE_GUARD(namespace_lock, struct rw_semaphore *, namespace_lock(), namespace_unlock()) - enum umount_tree_flags { UMOUNT_SYNC = 1, UMOUNT_PROPAGATE = 2, @@ -2306,7 +2310,7 @@ struct path *collect_paths(const struct path *path, struct path *res = prealloc, *to_free = NULL; unsigned n = 0; - guard(rwsem_read)(&namespace_sem); + guard(namespace_shared)(); if (!check_mnt(root)) return ERR_PTR(-EINVAL); @@ -2361,7 +2365,7 @@ void dissolve_on_fput(struct vfsmount *mnt) return; } - scoped_guard(namespace_lock, &namespace_sem) { + scoped_guard(namespace_excl) { if (!anon_ns_root(m)) return; @@ -2435,7 +2439,7 @@ struct vfsmount *clone_private_mount(const struct path *path) struct mount *old_mnt = real_mount(path->mnt); struct mount *new_mnt; - guard(rwsem_read)(&namespace_sem); + guard(namespace_shared)(); if (IS_MNT_UNBINDABLE(old_mnt)) return ERR_PTR(-EINVAL); @@ -5957,7 +5961,7 @@ SYSCALL_DEFINE4(statmount, const struct mnt_id_req __user *, req, if (ret) return ret; - scoped_guard(rwsem_read, &namespace_sem) + scoped_guard(namespace_shared) ret = do_statmount(ks, kreq.mnt_id, kreq.mnt_ns_id, ns); if (!ret) @@ -6079,7 +6083,7 @@ SYSCALL_DEFINE4(listmount, const struct mnt_id_req __user *, req, * We only need to guard against mount topology changes as * listmount() doesn't care about any mount properties. */ - scoped_guard(rwsem_read, &namespace_sem) + scoped_guard(namespace_shared) ret = do_listmount(ns, kreq.mnt_id, last_mnt_id, kmnt_ids, nr_mnt_ids, (flags & LISTMOUNT_REVERSE)); if (ret <= 0) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 02/63] introduced guards for mount_lock 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-29 9:49 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 03/63] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro ` (60 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds mount_writer: write_seqlock; that's an equivalent of {un,}lock_mount_hash() mount_locked_reader: read_seqlock_excl; these tend to be open-coded. No bulk conversions, please - if nothing else, quite a few places take use mount_writer form when mount_locked_reader is sufficent. It needs to be dealt with carefully. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/mount.h b/fs/mount.h index 97737051a8b9..ed8c83ba836a 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -154,6 +154,11 @@ static inline void get_mnt_ns(struct mnt_namespace *ns) extern seqlock_t mount_lock; +DEFINE_LOCK_GUARD_0(mount_writer, write_seqlock(&mount_lock), + write_sequnlock(&mount_lock)) +DEFINE_LOCK_GUARD_0(mount_locked_reader, read_seqlock_excl(&mount_lock), + read_sequnlock_excl(&mount_lock)) + struct proc_mounts { struct mnt_namespace *ns; struct path root; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 02/63] introduced guards for mount_lock 2025-08-28 23:07 ` [PATCH v2 02/63] introduced guards for mount_lock Al Viro @ 2025-08-29 9:49 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-29 9:49 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:05AM +0100, Al Viro wrote: > mount_writer: write_seqlock; that's an equivalent of {un,}lock_mount_hash() > mount_locked_reader: read_seqlock_excl; these tend to be open-coded. > > No bulk conversions, please - if nothing else, quite a few places take > use mount_writer form when mount_locked_reader is sufficent. It needs > to be dealt with carefully. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 03/63] fs/namespace.c: allow to drop vfsmount references via __free(mntput) 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-08-28 23:07 ` [PATCH v2 02/63] introduced guards for mount_lock Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 04/63] __detach_mounts(): use guards Al Viro ` (59 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Note that just as path_put, it should never be done in scope of namespace_sem, be it shared or exclusive. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/namespace.c b/fs/namespace.c index fcea65587ff9..767ab751ee2a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -88,6 +88,8 @@ DEFINE_LOCK_GUARD_0(namespace_excl, namespace_lock(), namespace_unlock()) DEFINE_LOCK_GUARD_0(namespace_shared, down_read(&namespace_sem), up_read(&namespace_sem)) +DEFINE_FREE(mntput, struct vfsmount *, if (!IS_ERR(_T)) mntput(_T)) + #ifdef CONFIG_FSNOTIFY LIST_HEAD(notify_list); /* protected by namespace_sem */ #endif -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 04/63] __detach_mounts(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-08-28 23:07 ` [PATCH v2 02/63] introduced guards for mount_lock Al Viro 2025-08-28 23:07 ` [PATCH v2 03/63] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-29 9:48 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 05/63] __is_local_mountpoint(): " Al Viro ` (58 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Clean fit for guards use; guards can't be weaker due to umount_tree() calls. --- fs/namespace.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 767ab751ee2a..1ae1ab8815c9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2032,10 +2032,11 @@ void __detach_mounts(struct dentry *dentry) struct pinned_mountpoint mp = {}; struct mount *mnt; - namespace_lock(); - lock_mount_hash(); + guard(namespace_excl)(); + guard(mount_writer)(); + if (!lookup_mountpoint(dentry, &mp)) - goto out_unlock; + return; event++; while (mp.node.next) { @@ -2047,9 +2048,6 @@ void __detach_mounts(struct dentry *dentry) else umount_tree(mnt, UMOUNT_CONNECTED); } unpin_mountpoint(&mp); -out_unlock: - unlock_mount_hash(); - namespace_unlock(); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 04/63] __detach_mounts(): use guards 2025-08-28 23:07 ` [PATCH v2 04/63] __detach_mounts(): use guards Al Viro @ 2025-08-29 9:48 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-29 9:48 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:07AM +0100, Al Viro wrote: > Clean fit for guards use; guards can't be weaker due to umount_tree() calls. > --- Did you drop my earlier RvB on accident? In any case: Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 05/63] __is_local_mountpoint(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (2 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 04/63] __detach_mounts(): use guards Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 06/63] do_change_type(): " Al Viro ` (57 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_shared due to iterating through ns->mounts. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 1ae1ab8815c9..f1460ddd1486 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -906,17 +906,14 @@ bool __is_local_mountpoint(const struct dentry *dentry) { struct mnt_namespace *ns = current->nsproxy->mnt_ns; struct mount *mnt, *n; - bool is_covered = false; - down_read(&namespace_sem); - rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) { - is_covered = (mnt->mnt_mountpoint == dentry); - if (is_covered) - break; - } - up_read(&namespace_sem); + guard(namespace_shared)(); + + rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) + if (mnt->mnt_mountpoint == dentry) + return true; - return is_covered; + return false; } struct pinned_mountpoint { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 06/63] do_change_type(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (3 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 05/63] __is_local_mountpoint(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 07/63] do_set_group(): " Al Viro ` (56 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_excl to modify propagation graph Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index f1460ddd1486..a6a7b068770a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2899,7 +2899,7 @@ static int do_change_type(struct path *path, int ms_flags) struct mount *mnt = real_mount(path->mnt); int recurse = ms_flags & MS_REC; int type; - int err = 0; + int err; if (!path_mounted(path)) return -EINVAL; @@ -2908,23 +2908,22 @@ static int do_change_type(struct path *path, int ms_flags) if (!type) return -EINVAL; - namespace_lock(); + guard(namespace_excl)(); + err = may_change_propagation(mnt); if (err) - goto out_unlock; + return err; if (type == MS_SHARED) { err = invent_group_ids(mnt, recurse); if (err) - goto out_unlock; + return err; } for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL)) change_mnt_propagation(m, type); - out_unlock: - namespace_unlock(); - return err; + return 0; } /* may_copy_tree() - check if a mount tree can be copied -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 07/63] do_set_group(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (4 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 06/63] do_change_type(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 08/63] mark_mounts_for_expiry(): " Al Viro ` (55 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_excl to modify propagation graph Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a6a7b068770a..13e2f3837a26 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3349,47 +3349,44 @@ static inline int tree_contains_unbindable(struct mount *mnt) static int do_set_group(struct path *from_path, struct path *to_path) { - struct mount *from, *to; + struct mount *from = real_mount(from_path->mnt); + struct mount *to = real_mount(to_path->mnt); int err; - from = real_mount(from_path->mnt); - to = real_mount(to_path->mnt); - - namespace_lock(); + guard(namespace_excl)(); err = may_change_propagation(from); if (err) - goto out; + return err; err = may_change_propagation(to); if (err) - goto out; + return err; - err = -EINVAL; /* To and From paths should be mount roots */ if (!path_mounted(from_path)) - goto out; + return -EINVAL; if (!path_mounted(to_path)) - goto out; + return -EINVAL; /* Setting sharing groups is only allowed across same superblock */ if (from->mnt.mnt_sb != to->mnt.mnt_sb) - goto out; + return -EINVAL; /* From mount root should be wider than To mount root */ if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root)) - goto out; + return -EINVAL; /* From mount should not have locked children in place of To's root */ if (__has_locked_children(from, to->mnt.mnt_root)) - goto out; + return -EINVAL; /* Setting sharing groups is only allowed on private mounts */ if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to)) - goto out; + return -EINVAL; /* From should not be private */ if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from)) - goto out; + return -EINVAL; if (IS_MNT_SLAVE(from)) { hlist_add_behind(&to->mnt_slave, &from->mnt_slave); @@ -3401,11 +3398,7 @@ static int do_set_group(struct path *from_path, struct path *to_path) list_add(&to->mnt_share, &from->mnt_share); set_mnt_shared(to); } - - err = 0; -out: - namespace_unlock(); - return err; + return 0; } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 08/63] mark_mounts_for_expiry(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (5 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 07/63] do_set_group(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 09/63] put_mnt_ns(): " Al Viro ` (54 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Clean fit; guards can't be weaker due to umount_tree() calls. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 13e2f3837a26..898a6b7307e4 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3886,8 +3886,8 @@ void mark_mounts_for_expiry(struct list_head *mounts) if (list_empty(mounts)) return; - namespace_lock(); - lock_mount_hash(); + guard(namespace_excl)(); + guard(mount_writer)(); /* extract from the expiration list every vfsmount that matches the * following criteria: @@ -3909,8 +3909,6 @@ void mark_mounts_for_expiry(struct list_head *mounts) touch_mnt_namespace(mnt->mnt_ns); umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC); } - unlock_mount_hash(); - namespace_unlock(); } EXPORT_SYMBOL_GPL(mark_mounts_for_expiry); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 09/63] put_mnt_ns(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (6 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 08/63] mark_mounts_for_expiry(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 10/63] mnt_already_visible(): " Al Viro ` (53 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; guards can't be weaker due to umount_tree() call. Setting emptied_ns requires namespace_excl, but not anything mount_lock-related. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 898a6b7307e4..86a86be2b0ef 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6153,12 +6153,10 @@ void put_mnt_ns(struct mnt_namespace *ns) { if (!refcount_dec_and_test(&ns->ns.count)) return; - namespace_lock(); + guard(namespace_excl)(); emptied_ns = ns; - lock_mount_hash(); + guard(mount_writer)(); umount_tree(ns->root, 0); - unlock_mount_hash(); - namespace_unlock(); } struct vfsmount *kern_mount(struct file_system_type *type) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 10/63] mnt_already_visible(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (7 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 09/63] put_mnt_ns(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 11/63] check_for_nsfs_mounts(): no need to take locks Al Viro ` (52 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_shared due to iterating through ns->mounts. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 86a86be2b0ef..a5d37b97088f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6232,9 +6232,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns, { int new_flags = *new_mnt_flags; struct mount *mnt, *n; - bool visible = false; - down_read(&namespace_sem); + guard(namespace_shared)(); rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) { struct mount *child; int mnt_flags; @@ -6281,13 +6280,10 @@ static bool mnt_already_visible(struct mnt_namespace *ns, /* Preserve the locked attributes */ *new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \ MNT_LOCK_ATIME); - visible = true; - goto found; + return true; next: ; } -found: - up_read(&namespace_sem); - return visible; + return false; } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 11/63] check_for_nsfs_mounts(): no need to take locks 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (8 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 10/63] mnt_already_visible(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 12/63] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro ` (51 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Currently we are taking mount_writer; what that function needs is either mount_locked_reader (we are not changing anything, we just want to iterate through the subtree) or namespace_shared and a reference held by caller on the root of subtree - that's also enough to stabilize the topology. The thing is, all callers are already holding at least namespace_shared as well as a reference to the root of subtree. Let's make the callers provide locking warranties - don't mess with mount_lock in check_for_nsfs_mounts() itself and document the locking requirements. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a5d37b97088f..59948cbf9c47 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2402,21 +2402,15 @@ bool has_locked_children(struct mount *mnt, struct dentry *dentry) * specified subtree. Such references can act as pins for mount namespaces * that aren't checked by the mount-cycle checking code, thereby allowing * cycles to be made. + * + * locks: mount_locked_reader || namespace_shared && pinned(subtree) */ static bool check_for_nsfs_mounts(struct mount *subtree) { - struct mount *p; - bool ret = false; - - lock_mount_hash(); - for (p = subtree; p; p = next_mnt(p, subtree)) + for (struct mount *p = subtree; p; p = next_mnt(p, subtree)) if (mnt_ns_loop(p->mnt.mnt_root)) - goto out; - - ret = true; -out: - unlock_mount_hash(); - return ret; + return false; + return true; } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 12/63] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (9 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 11/63] check_for_nsfs_mounts(): no need to take locks Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 13/63] has_locked_children(): use guards Al Viro ` (50 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/pnode.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/pnode.c b/fs/pnode.c index 6f7d02f3fa98..0702d45d856d 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -304,9 +304,8 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp, err = PTR_ERR(this); break; } - read_seqlock_excl(&mount_lock); - mnt_set_mountpoint(n, dest_mp, this); - read_sequnlock_excl(&mount_lock); + scoped_guard(mount_locked_reader) + mnt_set_mountpoint(n, dest_mp, this); if (n->mnt_master) SET_MNT_MARK(n->mnt_master); copy = this; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 13/63] has_locked_children(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (10 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 12/63] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-29 9:49 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 14/63] mnt_set_expiry(): " Al Viro ` (49 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and document the locking requirements of __has_locked_children() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 59948cbf9c47..2cb3cb8307ca 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2373,6 +2373,7 @@ void dissolve_on_fput(struct vfsmount *mnt) } } +/* locks: namespace_shared && pinned(mnt) || mount_locked_reader */ static bool __has_locked_children(struct mount *mnt, struct dentry *dentry) { struct mount *child; @@ -2389,12 +2390,8 @@ static bool __has_locked_children(struct mount *mnt, struct dentry *dentry) bool has_locked_children(struct mount *mnt, struct dentry *dentry) { - bool res; - - read_seqlock_excl(&mount_lock); - res = __has_locked_children(mnt, dentry); - read_sequnlock_excl(&mount_lock); - return res; + guard(mount_locked_reader)(); + return __has_locked_children(mnt, dentry); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 13/63] has_locked_children(): use guards 2025-08-28 23:07 ` [PATCH v2 13/63] has_locked_children(): use guards Al Viro @ 2025-08-29 9:49 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-29 9:49 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:16AM +0100, Al Viro wrote: > ... and document the locking requirements of __has_locked_children() > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 14/63] mnt_set_expiry(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (11 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 13/63] has_locked_children(): use guards Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-29 9:49 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 15/63] path_is_under(): " Al Viro ` (48 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds The reason why it needs only mount_locked_reader is that there's no lockless accesses of expiry lists. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 2cb3cb8307ca..db25c81d7f68 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3858,9 +3858,8 @@ int finish_automount(struct vfsmount *m, const struct path *path) */ void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list) { - read_seqlock_excl(&mount_lock); + guard(mount_locked_reader)(); list_add_tail(&real_mount(mnt)->mnt_expire, expiry_list); - read_sequnlock_excl(&mount_lock); } EXPORT_SYMBOL(mnt_set_expiry); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 14/63] mnt_set_expiry(): use guards 2025-08-28 23:07 ` [PATCH v2 14/63] mnt_set_expiry(): " Al Viro @ 2025-08-29 9:49 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-29 9:49 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:17AM +0100, Al Viro wrote: > The reason why it needs only mount_locked_reader is that there's no lockless > accesses of expiry lists. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 15/63] path_is_under(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (12 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 14/63] mnt_set_expiry(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 16/63] current_chrooted(): don't bother with follow_down_one() Al Viro ` (47 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and document that locking requirements for is_path_reachable(). There is one questionable caller in do_listmount() where we are not holding mount_lock *and* might not have the first argument mounted. However, in that case it will immediately return true without having to look at the ancestors. Might be cleaner to move the check into non-LSTM_ROOT case which it really belongs in - there the check is not always true and is_mounted() is guaranteed. Document the locking environments for is_path_reachable() callers: get_peer_under_root() get_dominating_id() do_statmount() do_listmount() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 11 +++++------ fs/pnode.c | 3 ++- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index db25c81d7f68..6aabf0045389 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4592,7 +4592,7 @@ SYSCALL_DEFINE5(move_mount, /* * Return true if path is reachable from root * - * namespace_sem or mount_lock is held + * locks: mount_locked_reader || namespace_shared && is_mounted(mnt) */ bool is_path_reachable(struct mount *mnt, struct dentry *dentry, const struct path *root) @@ -4606,11 +4606,8 @@ bool is_path_reachable(struct mount *mnt, struct dentry *dentry, bool path_is_under(const struct path *path1, const struct path *path2) { - bool res; - read_seqlock_excl(&mount_lock); - res = is_path_reachable(real_mount(path1->mnt), path1->dentry, path2); - read_sequnlock_excl(&mount_lock); - return res; + guard(mount_locked_reader)(); + return is_path_reachable(real_mount(path1->mnt), path1->dentry, path2); } EXPORT_SYMBOL(path_is_under); @@ -5689,6 +5686,7 @@ static int grab_requested_root(struct mnt_namespace *ns, struct path *root) STATMOUNT_MNT_UIDMAP | \ STATMOUNT_MNT_GIDMAP) +/* locks: namespace_shared */ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id, struct mnt_namespace *ns) { @@ -5949,6 +5947,7 @@ SYSCALL_DEFINE4(statmount, const struct mnt_id_req __user *, req, return ret; } +/* locks: namespace_shared */ static ssize_t do_listmount(struct mnt_namespace *ns, u64 mnt_parent_id, u64 last_mnt_id, u64 *mnt_ids, size_t nr_mnt_ids, bool reverse) diff --git a/fs/pnode.c b/fs/pnode.c index 0702d45d856d..edaf9d9d0eaf 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -29,6 +29,7 @@ static inline struct mount *next_slave(struct mount *p) return hlist_entry(p->mnt_slave.next, struct mount, mnt_slave); } +/* locks: namespace_shared && is_mounted(mnt) */ static struct mount *get_peer_under_root(struct mount *mnt, struct mnt_namespace *ns, const struct path *root) @@ -50,7 +51,7 @@ static struct mount *get_peer_under_root(struct mount *mnt, * Get ID of closest dominating peer group having a representative * under the given root. * - * Caller must hold namespace_sem + * locks: namespace_shared */ int get_dominating_id(struct mount *mnt, const struct path *root) { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 16/63] current_chrooted(): don't bother with follow_down_one() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (13 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 15/63] path_is_under(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 17/63] current_chrooted(): use guards Al Viro ` (46 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds All we need here is to follow ->overmount on root mount of namespace... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 6aabf0045389..cf680fbf015e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6194,24 +6194,22 @@ bool our_mnt(struct vfsmount *mnt) bool current_chrooted(void) { /* Does the current process have a non-standard root */ - struct path ns_root; + struct mount *root = current->nsproxy->mnt_ns->root; struct path fs_root; bool chrooted; + get_fs_root(current->fs, &fs_root); + /* Find the namespace root */ - ns_root.mnt = ¤t->nsproxy->mnt_ns->root->mnt; - ns_root.dentry = ns_root.mnt->mnt_root; - path_get(&ns_root); - while (d_mountpoint(ns_root.dentry) && follow_down_one(&ns_root)) - ; + read_seqlock_excl(&mount_lock); - get_fs_root(current->fs, &fs_root); + while (unlikely(root->overmount)) + root = root->overmount; - chrooted = !path_equal(&fs_root, &ns_root); + chrooted = fs_root.mnt != &root->mnt || !path_mounted(&fs_root); + read_sequnlock_excl(&mount_lock); path_put(&fs_root); - path_put(&ns_root); - return chrooted; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 17/63] current_chrooted(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (14 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 16/63] current_chrooted(): don't bother with follow_down_one() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 18/63] switch do_new_mount_fc() to fc_mount() Al Viro ` (45 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds here a use of __free(path_put) for dropping fs_root is enough to make guard(mount_locked_reader) fit... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index cf680fbf015e..0474b3a93dbf 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6194,23 +6194,20 @@ bool our_mnt(struct vfsmount *mnt) bool current_chrooted(void) { /* Does the current process have a non-standard root */ - struct mount *root = current->nsproxy->mnt_ns->root; - struct path fs_root; - bool chrooted; + struct path fs_root __free(path_put) = {}; + struct mount *root; get_fs_root(current->fs, &fs_root); /* Find the namespace root */ - read_seqlock_excl(&mount_lock); + guard(mount_locked_reader)(); + + root = current->nsproxy->mnt_ns->root; while (unlikely(root->overmount)) root = root->overmount; - chrooted = fs_root.mnt != &root->mnt || !path_mounted(&fs_root); - - read_sequnlock_excl(&mount_lock); - path_put(&fs_root); - return chrooted; + return fs_root.mnt != &root->mnt || !path_mounted(&fs_root); } static bool mnt_already_visible(struct mnt_namespace *ns, -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 18/63] switch do_new_mount_fc() to fc_mount() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (15 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 17/63] current_chrooted(): use guards Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-29 9:53 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 19/63] do_move_mount(): trim local variables Al Viro ` (44 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Prior to the call of do_new_mount_fc() the caller has just done successful vfs_get_tree(). Then do_new_mount_fc() does several checks on resulting superblock, and either does fc_drop_locked() and returns an error or proceeds to unlock the superblock and call vfs_create_mount(). The thing is, there's no reason to delay that unlock + vfs_create_mount() - the tests do not rely upon the state of ->s_umount and fc_drop_locked() put_fs_context() is equivalent to unlock ->s_umount put_fs_context() Doing vfs_create_mount() before the checks allows us to move vfs_get_tree() from caller to do_new_mount_fc() and collapse it with vfs_create_mount() into an fc_mount() call. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 0474b3a93dbf..9b575c9eee0b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3705,25 +3705,20 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { - struct vfsmount *mnt; struct pinned_mountpoint mp = {}; - struct super_block *sb = fc->root->d_sb; + struct super_block *sb; + struct vfsmount *mnt = fc_mount(fc); int error; + if (IS_ERR(mnt)) + return PTR_ERR(mnt); + + sb = fc->root->d_sb; error = security_sb_kern_mount(sb); if (!error && mount_too_revealing(sb, &mnt_flags)) error = -EPERM; - - if (unlikely(error)) { - fc_drop_locked(fc); - return error; - } - - up_write(&sb->s_umount); - - mnt = vfs_create_mount(fc); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); + if (unlikely(error)) + goto out; mnt_warn_timestamp_expiry(mountpoint, mnt); @@ -3731,10 +3726,12 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, if (!error) { error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); + if (!error) + mnt = NULL; // consumed on success unlock_mount(&mp); } - if (error < 0) - mntput(mnt); +out: + mntput(mnt); return error; } @@ -3788,8 +3785,6 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, err = parse_monolithic_mount_data(fc, data); if (!err && !mount_capable(fc)) err = -EPERM; - if (!err) - err = vfs_get_tree(fc); if (!err) err = do_new_mount_fc(fc, path, mnt_flags); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 18/63] switch do_new_mount_fc() to fc_mount() 2025-08-28 23:07 ` [PATCH v2 18/63] switch do_new_mount_fc() to fc_mount() Al Viro @ 2025-08-29 9:53 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-29 9:53 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:21AM +0100, Al Viro wrote: > Prior to the call of do_new_mount_fc() the caller has just done successful > vfs_get_tree(). Then do_new_mount_fc() does several checks on resulting > superblock, and either does fc_drop_locked() and returns an error or > proceeds to unlock the superblock and call vfs_create_mount(). > > The thing is, there's no reason to delay that unlock + vfs_create_mount() - > the tests do not rely upon the state of ->s_umount and > fc_drop_locked() > put_fs_context() > is equivalent to > unlock ->s_umount > put_fs_context() > > Doing vfs_create_mount() before the checks allows us to move vfs_get_tree() > from caller to do_new_mount_fc() and collapse it with vfs_create_mount() > into an fc_mount() call. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> > fs/namespace.c | 29 ++++++++++++----------------- > 1 file changed, 12 insertions(+), 17 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index 0474b3a93dbf..9b575c9eee0b 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -3705,25 +3705,20 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags > static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, > unsigned int mnt_flags) > { > - struct vfsmount *mnt; > struct pinned_mountpoint mp = {}; > - struct super_block *sb = fc->root->d_sb; > + struct super_block *sb; > + struct vfsmount *mnt = fc_mount(fc); > int error; > > + if (IS_ERR(mnt)) > + return PTR_ERR(mnt); Fwiw, I find this pattern where the variable is assigned by function call at declaration time in the middle of other variables and then immediately further below check for the error to be rather ugly. I'd much rather just do: + struct vfsmount *mnt; int error; mnt = fc_mount(fc) + if (IS_ERR(mnt)) + return PTR_ERR(mnt); But anyway, I acknowledge the difference in taste here is really not that important. ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 19/63] do_move_mount(): trim local variables 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (16 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 18/63] switch do_new_mount_fc() to fc_mount() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 20/63] do_move_mount(): deal with the checks on old_path early Al Viro ` (43 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Both 'parent' and 'ns' are used at most once, no point precalculating those... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9b575c9eee0b..ad9b5687ff15 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3564,10 +3564,8 @@ static inline bool may_use_mount(struct mount *mnt) static int do_move_mount(struct path *old_path, struct path *new_path, enum mnt_tree_flags_t flags) { - struct mnt_namespace *ns; struct mount *p; struct mount *old; - struct mount *parent; struct pinned_mountpoint mp; int err; bool beneath = flags & MNT_TREE_BENEATH; @@ -3578,8 +3576,6 @@ static int do_move_mount(struct path *old_path, old = real_mount(old_path->mnt); p = real_mount(new_path->mnt); - parent = old->mnt_parent; - ns = old->mnt_ns; err = -EINVAL; @@ -3588,12 +3584,12 @@ static int do_move_mount(struct path *old_path, /* ... it should be detachable from parent */ if (!mnt_has_parent(old) || IS_MNT_LOCKED(old)) goto out; + /* ... which should not be shared */ + if (IS_MNT_SHARED(old->mnt_parent)) + goto out; /* ... and the target should be in our namespace */ if (!check_mnt(p)) goto out; - /* parent of the source should not be shared */ - if (IS_MNT_SHARED(parent)) - goto out; } else { /* * otherwise the source must be the root of some anon namespace. @@ -3605,7 +3601,7 @@ static int do_move_mount(struct path *old_path, * subsequent checks would've rejected that, but they lose * some corner cases if we check it early. */ - if (ns == p->mnt_ns) + if (old->mnt_ns == p->mnt_ns) goto out; /* * Target should be either in our namespace or in an acceptable -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 20/63] do_move_mount(): deal with the checks on old_path early 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (17 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 19/63] do_move_mount(): trim local variables Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 21/63] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro ` (42 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds 1) checking that location we want to move does point to root of some mount can be done before anything else; that property is not going to change and having it already verified simplifies the analysis. 2) checking the type agreement between what we are trying to move and what we are trying to move it onto also belongs in the very beginning - do_lock_mount() might end up switching new_path to something that overmounts the original location, but... the same type agreement applies to overmounts, so we could just as well check against the original location. 3) since we know that old_path->dentry is the root of old_path->mnt, there's no point bothering with path_is_overmounted() in can_move_mount_beneath(); it's simply a check for the mount we are trying to move having non-NULL ->overmount. And with that, we can switch can_move_mount_beneath() to taking old instead of old_path, leaving no uses of old_path past the original checks. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index ad9b5687ff15..74c67ea1b5a8 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3433,7 +3433,7 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) /** * can_move_mount_beneath - check that we can mount beneath the top mount - * @from: mount to mount beneath + * @mnt_from: mount we are trying to move * @to: mount under which to mount * @mp: mountpoint of @to * @@ -3443,7 +3443,7 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * root or the rootfs of the namespace. * - Make sure that the caller can unmount the topmost mount ensuring * that the caller could reveal the underlying mountpoint. - * - Ensure that nothing has been mounted on top of @from before we + * - Ensure that nothing has been mounted on top of @mnt_from before we * grabbed @namespace_sem to avoid creating pointless shadow mounts. * - Prevent mounting beneath a mount if the propagation relationship * between the source mount, parent mount, and top mount would lead to @@ -3452,12 +3452,11 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * Context: This function expects namespace_lock() to be held. * Return: On success 0, and on error a negative error code is returned. */ -static int can_move_mount_beneath(const struct path *from, +static int can_move_mount_beneath(struct mount *mnt_from, const struct path *to, const struct mountpoint *mp) { - struct mount *mnt_from = real_mount(from->mnt), - *mnt_to = real_mount(to->mnt), + struct mount *mnt_to = real_mount(to->mnt), *parent_mnt_to = mnt_to->mnt_parent; if (!mnt_has_parent(mnt_to)) @@ -3470,7 +3469,7 @@ static int can_move_mount_beneath(const struct path *from, return -EINVAL; /* Avoid creating shadow mounts during mount propagation. */ - if (path_overmounted(from)) + if (mnt_from->overmount) return -EINVAL; /* @@ -3565,16 +3564,21 @@ static int do_move_mount(struct path *old_path, struct path *new_path, enum mnt_tree_flags_t flags) { struct mount *p; - struct mount *old; + struct mount *old = real_mount(old_path->mnt); struct pinned_mountpoint mp; int err; bool beneath = flags & MNT_TREE_BENEATH; + if (!path_mounted(old_path)) + return -EINVAL; + + if (d_is_dir(new_path->dentry) != d_is_dir(old_path->dentry)) + return -EINVAL; + err = do_lock_mount(new_path, &mp, beneath); if (err) return err; - old = real_mount(old_path->mnt); p = real_mount(new_path->mnt); err = -EINVAL; @@ -3611,15 +3615,8 @@ static int do_move_mount(struct path *old_path, goto out; } - if (!path_mounted(old_path)) - goto out; - - if (d_is_dir(new_path->dentry) != - d_is_dir(old_path->dentry)) - goto out; - if (beneath) { - err = can_move_mount_beneath(old_path, new_path, mp.mp); + err = can_move_mount_beneath(old, new_path, mp.mp); if (err) goto out; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 21/63] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (18 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 20/63] do_move_mount(): deal with the checks on old_path early Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 22/63] finish_automount(): simplify the ELOOP check Al Viro ` (41 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds We want to mount beneath the given location. For that operation to make sense, location must be the root of some mount that has something under it. Currently we let it proceed if those requirements are not met, with rather meaningless results, and have that bogosity caught further down the road; let's fail early instead - do_lock_mount() doesn't make sense unless those conditions hold, and checking them there makes things simpler. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 74c67ea1b5a8..86c6dd432b13 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2768,12 +2768,19 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo struct path under = {}; int err = -ENOENT; + if (unlikely(beneath) && !path_mounted(path)) + return -EINVAL; + for (;;) { struct mount *m = real_mount(mnt); if (beneath) { path_put(&under); read_seqlock_excl(&mount_lock); + if (unlikely(!mnt_has_parent(m))) { + read_sequnlock_excl(&mount_lock); + return -EINVAL; + } under.mnt = mntget(&m->mnt_parent->mnt); under.dentry = dget(m->mnt_mountpoint); read_sequnlock_excl(&mount_lock); @@ -3437,8 +3444,6 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * @to: mount under which to mount * @mp: mountpoint of @to * - * - Make sure that @to->dentry is actually the root of a mount under - * which we can mount another mount. * - Make sure that nothing can be mounted beneath the caller's current * root or the rootfs of the namespace. * - Make sure that the caller can unmount the topmost mount ensuring @@ -3459,12 +3464,6 @@ static int can_move_mount_beneath(struct mount *mnt_from, struct mount *mnt_to = real_mount(to->mnt), *parent_mnt_to = mnt_to->mnt_parent; - if (!mnt_has_parent(mnt_to)) - return -EINVAL; - - if (!path_mounted(to)) - return -EINVAL; - if (IS_MNT_LOCKED(mnt_to)) return -EINVAL; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 22/63] finish_automount(): simplify the ELOOP check 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (19 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 21/63] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 23/63] do_loopback(): use __free(path_put) to deal with old_path Al Viro ` (40 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds It's enough to check that dentries match; if path->dentry is equal to m->mnt_root, superblocks will match as well. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 86c6dd432b13..bdb33270ac6e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3798,8 +3798,7 @@ int finish_automount(struct vfsmount *m, const struct path *path) mnt = real_mount(m); - if (m->mnt_sb == path->mnt->mnt_sb && - m->mnt_root == dentry) { + if (m->mnt_root == path->dentry) { err = -ELOOP; goto discard; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 23/63] do_loopback(): use __free(path_put) to deal with old_path 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (20 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 22/63] finish_automount(): simplify the ELOOP check Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 24/63] pivot_root(2): use __free() to deal with struct path in it Al Viro ` (39 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds preparations for making unlock_mount() a __cleanup(); can't have path_put() inside mount_lock scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index bdb33270ac6e..245cf2d19a6b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3014,7 +3014,7 @@ static struct mount *__do_loopback(struct path *old_path, int recurse) static int do_loopback(struct path *path, const char *old_name, int recurse) { - struct path old_path; + struct path old_path __free(path_put) = {}; struct mount *mnt = NULL, *parent; struct pinned_mountpoint mp = {}; int err; @@ -3024,13 +3024,12 @@ static int do_loopback(struct path *path, const char *old_name, if (err) return err; - err = -EINVAL; if (mnt_ns_loop(old_path.dentry)) - goto out; + return -EINVAL; err = lock_mount(path, &mp); if (err) - goto out; + return err; parent = real_mount(path->mnt); if (!check_mnt(parent)) @@ -3050,8 +3049,6 @@ static int do_loopback(struct path *path, const char *old_name, } out2: unlock_mount(&mp); -out: - path_put(&old_path); return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 24/63] pivot_root(2): use __free() to deal with struct path in it 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (21 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 23/63] do_loopback(): use __free(path_put) to deal with old_path Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 25/63] finish_automount(): take the lock_mount() analogue into a helper Al Viro ` (38 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds preparations for making unlock_mount() a __cleanup(); can't have path_put() inside mount_lock scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 245cf2d19a6b..90b62ee882da 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4622,7 +4622,9 @@ EXPORT_SYMBOL(path_is_under); SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, const char __user *, put_old) { - struct path new, old, root; + struct path new __free(path_put) = {}; + struct path old __free(path_put) = {}; + struct path root __free(path_put) = {}; struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent; struct pinned_mountpoint old_mp = {}; int error; @@ -4633,21 +4635,21 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, error = user_path_at(AT_FDCWD, new_root, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new); if (error) - goto out0; + return error; error = user_path_at(AT_FDCWD, put_old, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old); if (error) - goto out1; + return error; error = security_sb_pivotroot(&old, &new); if (error) - goto out2; + return error; get_fs_root(current->fs, &root); error = lock_mount(&old, &old_mp); if (error) - goto out3; + return error; error = -EINVAL; new_mnt = real_mount(new.mnt); @@ -4705,13 +4707,6 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, error = 0; out4: unlock_mount(&old_mp); -out3: - path_put(&root); -out2: - path_put(&old); -out1: - path_put(&new); -out0: return error; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 25/63] finish_automount(): take the lock_mount() analogue into a helper 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (22 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 24/63] pivot_root(2): use __free() to deal with struct path in it Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 26/63] do_new_mount_rc(): use __free() to deal with dropping mnt on failure Al Viro ` (37 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds finish_automount() can't use lock_mount() - it treats finding something already mounted as "quitely drop our mount and return 0", not as "mount on top of whatever mounted there". It's been open-coded; let's take it into a helper similar to lock_mount(). "something's already mounted" => -EBUSY, finish_automount() needs to distinguish it from the normal case and it can't happen in other failure cases. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 42 +++++++++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 90b62ee882da..6251ee15f5f6 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3781,9 +3781,29 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, return err; } -int finish_automount(struct vfsmount *m, const struct path *path) +static int lock_mount_exact(const struct path *path, + struct pinned_mountpoint *mp) { struct dentry *dentry = path->dentry; + int err; + + inode_lock(dentry->d_inode); + namespace_lock(); + if (unlikely(cant_mount(dentry))) + err = -ENOENT; + else if (path_overmounted(path)) + err = -EBUSY; + else + err = get_mountpoint(dentry, mp); + if (unlikely(err)) { + namespace_unlock(); + inode_unlock(dentry->d_inode); + } + return err; +} + +int finish_automount(struct vfsmount *m, const struct path *path) +{ struct pinned_mountpoint mp = {}; struct mount *mnt; int err; @@ -3805,20 +3825,11 @@ int finish_automount(struct vfsmount *m, const struct path *path) * that overmounts our mountpoint to be means "quitely drop what we've * got", not "try to mount it on top". */ - inode_lock(dentry->d_inode); - namespace_lock(); - if (unlikely(cant_mount(dentry))) { - err = -ENOENT; - goto discard_locked; - } - if (path_overmounted(path)) { - err = 0; - goto discard_locked; + err = lock_mount_exact(path, &mp); + if (unlikely(err)) { + mntput(m); + return err == -EBUSY ? 0 : err; } - err = get_mountpoint(dentry, &mp); - if (err) - goto discard_locked; - err = do_add_mount(mnt, mp.mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); unlock_mount(&mp); @@ -3826,9 +3837,6 @@ int finish_automount(struct vfsmount *m, const struct path *path) goto discard; return 0; -discard_locked: - namespace_unlock(); - inode_unlock(dentry->d_inode); discard: mntput(m); return err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 26/63] do_new_mount_rc(): use __free() to deal with dropping mnt on failure 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (23 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 25/63] finish_automount(): take the lock_mount() analogue into a helper Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-09-01 11:34 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 27/63] finish_automount(): " Al Viro ` (36 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds do_add_mount() consumes vfsmount on success; just follow it with conditional retain_and_null_ptr() on success and we can switch to __free() for mnt and be done with that - unlock_mount() is in the very end. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 6251ee15f5f6..3551e51461a2 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3696,7 +3696,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, { struct pinned_mountpoint mp = {}; struct super_block *sb; - struct vfsmount *mnt = fc_mount(fc); + struct vfsmount *mnt __free(mntput) = fc_mount(fc); int error; if (IS_ERR(mnt)) @@ -3704,10 +3704,11 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, sb = fc->root->d_sb; error = security_sb_kern_mount(sb); - if (!error && mount_too_revealing(sb, &mnt_flags)) - error = -EPERM; if (unlikely(error)) - goto out; + return error; + + if (unlikely(mount_too_revealing(sb, &mnt_flags))) + return -EPERM; mnt_warn_timestamp_expiry(mountpoint, mnt); @@ -3716,11 +3717,9 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); if (!error) - mnt = NULL; // consumed on success + retain_and_null_ptr(mnt); // consumed on success unlock_mount(&mp); } -out: - mntput(mnt); return error; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 26/63] do_new_mount_rc(): use __free() to deal with dropping mnt on failure 2025-08-28 23:07 ` [PATCH v2 26/63] do_new_mount_rc(): use __free() to deal with dropping mnt on failure Al Viro @ 2025-09-01 11:34 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:34 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:29AM +0100, Al Viro wrote: > do_add_mount() consumes vfsmount on success; just follow it with > conditional retain_and_null_ptr() on success and we can switch > to __free() for mnt and be done with that - unlock_mount() is > in the very end. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 27/63] finish_automount(): use __free() to deal with dropping mnt on failure 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (24 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 26/63] do_new_mount_rc(): use __free() to deal with dropping mnt on failure Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 28/63] change calling conventions for lock_mount() et.al Al Viro ` (35 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds same story as with do_new_mount_fc(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 3551e51461a2..779cfed04291 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3801,8 +3801,9 @@ static int lock_mount_exact(const struct path *path, return err; } -int finish_automount(struct vfsmount *m, const struct path *path) +int finish_automount(struct vfsmount *__m, const struct path *path) { + struct vfsmount *m __free(mntput) = __m; struct pinned_mountpoint mp = {}; struct mount *mnt; int err; @@ -3814,10 +3815,8 @@ int finish_automount(struct vfsmount *m, const struct path *path) mnt = real_mount(m); - if (m->mnt_root == path->dentry) { - err = -ELOOP; - goto discard; - } + if (m->mnt_root == path->dentry) + return -ELOOP; /* * we don't want to use lock_mount() - in this case finding something @@ -3825,19 +3824,14 @@ int finish_automount(struct vfsmount *m, const struct path *path) * got", not "try to mount it on top". */ err = lock_mount_exact(path, &mp); - if (unlikely(err)) { - mntput(m); + if (unlikely(err)) return err == -EBUSY ? 0 : err; - } + err = do_add_mount(mnt, mp.mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); + if (likely(!err)) + retain_and_null_ptr(m); unlock_mount(&mp); - if (unlikely(err)) - goto discard; - return 0; - -discard: - mntput(m); return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 28/63] change calling conventions for lock_mount() et.al. 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (25 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 27/63] finish_automount(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-09-01 11:37 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 29/63] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro ` (34 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds 1) pinned_mountpoint gets a new member - struct mount *parent. Set only if we locked the sucker; ERR_PTR() - on failed attempt. 2) do_lock_mount() et.al. return void and set ->parent to * on success with !beneath - mount corresponding to path->mnt * on success with beneath - the parent of mount corresponding to path->mnt * in case of error - ERR_PTR(-E...). IOW, we get the mount we will be actually mounting upon or ERR_PTR(). 3) we can't use CLASS, since the pinned_mountpoint is placed on hlist during initialization, so we define local macros: LOCK_MOUNT(mp, path) LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) LOCK_MOUNT_EXACT(mp, path) All of them declare and initialize struct pinned_mountpoint mp, with unlock_mount done via __cleanup(). Users converted. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 219 ++++++++++++++++++++++++------------------------- 1 file changed, 108 insertions(+), 111 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 779cfed04291..952e66bdb9bb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -919,6 +919,7 @@ bool __is_local_mountpoint(const struct dentry *dentry) struct pinned_mountpoint { struct hlist_node node; struct mountpoint *mp; + struct mount *parent; }; static bool lookup_mountpoint(struct dentry *dentry, struct pinned_mountpoint *m) @@ -2728,48 +2729,47 @@ static int attach_recursive_mnt(struct mount *source_mnt, } /** - * do_lock_mount - lock mount and mountpoint - * @path: target path - * @beneath: whether the intention is to mount beneath @path + * do_lock_mount - acquire environment for mounting + * @path: target path + * @res: context to set up + * @beneath: whether the intention is to mount beneath @path * - * Follow the mount stack on @path until the top mount @mnt is found. If - * the initial @path->{mnt,dentry} is a mountpoint lookup the first - * mount stacked on top of it. Then simply follow @{mnt,mnt->mnt_root} - * until nothing is stacked on top of it anymore. + * To mount something at given location, we need + * namespace_sem locked exclusive + * inode of dentry we are mounting on locked exclusive + * struct mountpoint for that dentry + * struct mount we are mounting on * - * Acquire the inode_lock() on the top mount's ->mnt_root to protect - * against concurrent removal of the new mountpoint from another mount - * namespace. + * Results are stored in caller-supplied context (pinned_mountpoint); + * on success we have res->parent and res->mp pointing to parent and + * mountpoint respectively and res->node inserted into the ->m_list + * of the mountpoint, making sure the mountpoint won't disappear. + * On failure we have res->parent set to ERR_PTR(-E...), res->mp + * left NULL, res->node - empty. + * In case of success do_lock_mount returns with locks acquired (in + * proper order - inode lock nests outside of namespace_sem). * - * If @beneath is requested, acquire inode_lock() on @mnt's mountpoint - * @mp on @mnt->mnt_parent must be acquired. This protects against a - * concurrent unlink of @mp->mnt_dentry from another mount namespace - * where @mnt doesn't have a child mount mounted @mp. A concurrent - * removal of @mnt->mnt_root doesn't matter as nothing will be mounted - * on top of it for @beneath. + * Request to mount on overmounted location is treated as "mount on + * top of whatever's overmounting it"; request to mount beneath + * a location - "mount immediately beneath the topmost mount at that + * place". * - * In addition, @beneath needs to make sure that @mnt hasn't been - * unmounted or moved from its current mountpoint in between dropping - * @mount_lock and acquiring @namespace_sem. For the !@beneath case @mnt - * being unmounted would be detected later by e.g., calling - * check_mnt(mnt) in the function it's called from. For the @beneath - * case however, it's useful to detect it directly in do_lock_mount(). - * If @mnt hasn't been unmounted then @mnt->mnt_mountpoint still points - * to @mnt->mnt_mp->m_dentry. But if @mnt has been unmounted it will - * point to @mnt->mnt_root and @mnt->mnt_mp will be NULL. - * - * Return: Either the target mountpoint on the top mount or the top - * mount's mountpoint. + * In all cases the location must not have been unmounted and the + * chosen mountpoint must be allowed to be mounted on. For "beneath" + * case we also require the location to be at the root of a mount + * that has a parent (i.e. is not a root of some namespace). */ -static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bool beneath) +static void do_lock_mount(struct path *path, struct pinned_mountpoint *res, bool beneath) { struct vfsmount *mnt = path->mnt; struct dentry *dentry; struct path under = {}; int err = -ENOENT; - if (unlikely(beneath) && !path_mounted(path)) - return -EINVAL; + if (unlikely(beneath) && !path_mounted(path)) { + res->parent = ERR_PTR(-EINVAL); + return; + } for (;;) { struct mount *m = real_mount(mnt); @@ -2779,7 +2779,8 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo read_seqlock_excl(&mount_lock); if (unlikely(!mnt_has_parent(m))) { read_sequnlock_excl(&mount_lock); - return -EINVAL; + res->parent = ERR_PTR(-EINVAL); + return; } under.mnt = mntget(&m->mnt_parent->mnt); under.dentry = dget(m->mnt_mountpoint); @@ -2811,7 +2812,7 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo path->dentry = dget(mnt->mnt_root); continue; // got overmounted } - err = get_mountpoint(dentry, pinned); + err = get_mountpoint(dentry, res); if (err) break; if (beneath) { @@ -2822,22 +2823,25 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo * we are not dropping the final references here). */ path_put(&under); + res->parent = real_mount(path->mnt)->mnt_parent; + return; } - return 0; + res->parent = real_mount(path->mnt); + return; } namespace_unlock(); inode_unlock(dentry->d_inode); if (beneath) path_put(&under); - return err; + res->parent = ERR_PTR(err); } -static inline int lock_mount(struct path *path, struct pinned_mountpoint *m) +static inline void lock_mount(struct path *path, struct pinned_mountpoint *m) { - return do_lock_mount(path, m, false); + do_lock_mount(path, m, false); } -static void unlock_mount(struct pinned_mountpoint *m) +static void __unlock_mount(struct pinned_mountpoint *m) { inode_unlock(m->mp->m_dentry->d_inode); read_seqlock_excl(&mount_lock); @@ -2846,6 +2850,20 @@ static void unlock_mount(struct pinned_mountpoint *m) namespace_unlock(); } +static inline void unlock_mount(struct pinned_mountpoint *m) +{ + if (!IS_ERR(m->parent)) + __unlock_mount(m); +} + +#define LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) \ + struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \ + do_lock_mount((path), &mp, (beneath)) +#define LOCK_MOUNT(mp, path) LOCK_MOUNT_MAYBE_BENEATH(mp, (path), false) +#define LOCK_MOUNT_EXACT(mp, path) \ + struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \ + lock_mount_exact((path), &mp) + static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) { if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER) @@ -3015,8 +3033,7 @@ static int do_loopback(struct path *path, const char *old_name, int recurse) { struct path old_path __free(path_put) = {}; - struct mount *mnt = NULL, *parent; - struct pinned_mountpoint mp = {}; + struct mount *mnt = NULL; int err; if (!old_name || !*old_name) return -EINVAL; @@ -3027,28 +3044,23 @@ static int do_loopback(struct path *path, const char *old_name, if (mnt_ns_loop(old_path.dentry)) return -EINVAL; - err = lock_mount(path, &mp); - if (err) - return err; + LOCK_MOUNT(mp, path); + if (IS_ERR(mp.parent)) + return PTR_ERR(mp.parent); - parent = real_mount(path->mnt); - if (!check_mnt(parent)) - goto out2; + if (!check_mnt(mp.parent)) + return -EINVAL; mnt = __do_loopback(&old_path, recurse); - if (IS_ERR(mnt)) { - err = PTR_ERR(mnt); - goto out2; - } + if (IS_ERR(mnt)) + return PTR_ERR(mnt); - err = graft_tree(mnt, parent, mp.mp); + err = graft_tree(mnt, mp.parent, mp.mp); if (err) { lock_mount_hash(); umount_tree(mnt, UMOUNT_SYNC); unlock_mount_hash(); } -out2: - unlock_mount(&mp); return err; } @@ -3561,7 +3573,6 @@ static int do_move_mount(struct path *old_path, { struct mount *p; struct mount *old = real_mount(old_path->mnt); - struct pinned_mountpoint mp; int err; bool beneath = flags & MNT_TREE_BENEATH; @@ -3571,52 +3582,49 @@ static int do_move_mount(struct path *old_path, if (d_is_dir(new_path->dentry) != d_is_dir(old_path->dentry)) return -EINVAL; - err = do_lock_mount(new_path, &mp, beneath); - if (err) - return err; + LOCK_MOUNT_MAYBE_BENEATH(mp, new_path, beneath); + if (IS_ERR(mp.parent)) + return PTR_ERR(mp.parent); p = real_mount(new_path->mnt); - err = -EINVAL; - if (check_mnt(old)) { /* if the source is in our namespace... */ /* ... it should be detachable from parent */ if (!mnt_has_parent(old) || IS_MNT_LOCKED(old)) - goto out; + return -EINVAL; /* ... which should not be shared */ if (IS_MNT_SHARED(old->mnt_parent)) - goto out; + return -EINVAL; /* ... and the target should be in our namespace */ if (!check_mnt(p)) - goto out; + return -EINVAL; } else { /* * otherwise the source must be the root of some anon namespace. */ if (!anon_ns_root(old)) - goto out; + return -EINVAL; /* * Bail out early if the target is within the same namespace - * subsequent checks would've rejected that, but they lose * some corner cases if we check it early. */ if (old->mnt_ns == p->mnt_ns) - goto out; + return -EINVAL; /* * Target should be either in our namespace or in an acceptable * anon namespace, sensu check_anonymous_mnt(). */ if (!may_use_mount(p)) - goto out; + return -EINVAL; } if (beneath) { err = can_move_mount_beneath(old, new_path, mp.mp); if (err) - goto out; + return err; - err = -EINVAL; p = p->mnt_parent; } @@ -3625,17 +3633,13 @@ static int do_move_mount(struct path *old_path, * mount which is shared. */ if (IS_MNT_SHARED(p) && tree_contains_unbindable(old)) - goto out; - err = -ELOOP; + return -EINVAL; if (!check_for_nsfs_mounts(old)) - goto out; + return -ELOOP; if (mount_is_ancestor(old, p)) - goto out; + return -ELOOP; - err = attach_recursive_mnt(old, p, mp.mp); -out: - unlock_mount(&mp); - return err; + return attach_recursive_mnt(old, p, mp.mp); } static int do_move_mount_old(struct path *path, const char *old_name) @@ -3694,7 +3698,6 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { - struct pinned_mountpoint mp = {}; struct super_block *sb; struct vfsmount *mnt __free(mntput) = fc_mount(fc); int error; @@ -3712,13 +3715,14 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, mnt_warn_timestamp_expiry(mountpoint, mnt); - error = lock_mount(mountpoint, &mp); - if (!error) { + LOCK_MOUNT(mp, mountpoint); + if (IS_ERR(mp.parent)) { + return PTR_ERR(mp.parent); + } else { error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); if (!error) retain_and_null_ptr(mnt); // consumed on success - unlock_mount(&mp); } return error; } @@ -3780,8 +3784,8 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, return err; } -static int lock_mount_exact(const struct path *path, - struct pinned_mountpoint *mp) +static void lock_mount_exact(const struct path *path, + struct pinned_mountpoint *mp) { struct dentry *dentry = path->dentry; int err; @@ -3797,14 +3801,15 @@ static int lock_mount_exact(const struct path *path, if (unlikely(err)) { namespace_unlock(); inode_unlock(dentry->d_inode); + mp->parent = ERR_PTR(err); + } else { + mp->parent = real_mount(path->mnt); } - return err; } int finish_automount(struct vfsmount *__m, const struct path *path) { struct vfsmount *m __free(mntput) = __m; - struct pinned_mountpoint mp = {}; struct mount *mnt; int err; @@ -3823,15 +3828,14 @@ int finish_automount(struct vfsmount *__m, const struct path *path) * that overmounts our mountpoint to be means "quitely drop what we've * got", not "try to mount it on top". */ - err = lock_mount_exact(path, &mp); - if (unlikely(err)) - return err == -EBUSY ? 0 : err; + LOCK_MOUNT_EXACT(mp, path); + if (IS_ERR(mp.parent)) + return mp.parent == ERR_PTR(-EBUSY) ? 0 : PTR_ERR(mp.parent); err = do_add_mount(mnt, mp.mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); if (likely(!err)) retain_and_null_ptr(m); - unlock_mount(&mp); return err; } @@ -4627,7 +4631,6 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, struct path old __free(path_put) = {}; struct path root __free(path_put) = {}; struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent; - struct pinned_mountpoint old_mp = {}; int error; if (!may_mount()) @@ -4648,45 +4651,42 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, return error; get_fs_root(current->fs, &root); - error = lock_mount(&old, &old_mp); - if (error) - return error; - error = -EINVAL; + LOCK_MOUNT(old_mp, &old); + old_mnt = old_mp.parent; + if (IS_ERR(old_mnt)) + return PTR_ERR(old_mnt); + new_mnt = real_mount(new.mnt); root_mnt = real_mount(root.mnt); - old_mnt = real_mount(old.mnt); ex_parent = new_mnt->mnt_parent; root_parent = root_mnt->mnt_parent; if (IS_MNT_SHARED(old_mnt) || IS_MNT_SHARED(ex_parent) || IS_MNT_SHARED(root_parent)) - goto out4; + return -EINVAL; if (!check_mnt(root_mnt) || !check_mnt(new_mnt)) - goto out4; + return -EINVAL; if (new_mnt->mnt.mnt_flags & MNT_LOCKED) - goto out4; - error = -ENOENT; + return -EINVAL; if (d_unlinked(new.dentry)) - goto out4; - error = -EBUSY; + return -ENOENT; if (new_mnt == root_mnt || old_mnt == root_mnt) - goto out4; /* loop, on the same file system */ - error = -EINVAL; + return -EBUSY; /* loop, on the same file system */ if (!path_mounted(&root)) - goto out4; /* not a mountpoint */ + return -EINVAL; /* not a mountpoint */ if (!mnt_has_parent(root_mnt)) - goto out4; /* absolute root */ + return -EINVAL; /* absolute root */ if (!path_mounted(&new)) - goto out4; /* not a mountpoint */ + return -EINVAL; /* not a mountpoint */ if (!mnt_has_parent(new_mnt)) - goto out4; /* absolute root */ + return -EINVAL; /* absolute root */ /* make sure we can reach put_old from new_root */ if (!is_path_reachable(old_mnt, old.dentry, &new)) - goto out4; + return -EINVAL; /* make certain new is below the root */ if (!is_path_reachable(new_mnt, new.dentry, &root)) - goto out4; + return -EINVAL; lock_mount_hash(); umount_mnt(new_mnt); if (root_mnt->mnt.mnt_flags & MNT_LOCKED) { @@ -4705,10 +4705,7 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, mnt_notify_add(root_mnt); mnt_notify_add(new_mnt); chroot_fs_refs(&root, &new); - error = 0; -out4: - unlock_mount(&old_mp); - return error; + return 0; } static unsigned int recalc_flags(struct mount_kattr *kattr, struct mount *mnt) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 28/63] change calling conventions for lock_mount() et.al. 2025-08-28 23:07 ` [PATCH v2 28/63] change calling conventions for lock_mount() et.al Al Viro @ 2025-09-01 11:37 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:37 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:31AM +0100, Al Viro wrote: > 1) pinned_mountpoint gets a new member - struct mount *parent. > Set only if we locked the sucker; ERR_PTR() - on failed attempt. > > 2) do_lock_mount() et.al. return void and set ->parent to > * on success with !beneath - mount corresponding to path->mnt > * on success with beneath - the parent of mount corresponding > to path->mnt > * in case of error - ERR_PTR(-E...). > IOW, we get the mount we will be actually mounting upon or ERR_PTR(). > > 3) we can't use CLASS, since the pinned_mountpoint is placed on > hlist during initialization, so we define local macros: > LOCK_MOUNT(mp, path) > LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) > LOCK_MOUNT_EXACT(mp, path) > All of them declare and initialize struct pinned_mountpoint mp, > with unlock_mount done via __cleanup(). > > Users converted. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- This is nice! Thanks! Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 29/63] do_move_mount(): use the parent mount returned by do_lock_mount() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (26 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 28/63] change calling conventions for lock_mount() et.al Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-09-01 11:38 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 30/63] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro ` (33 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds After successful do_lock_mount() call, mp.parent is set to either real_mount(path->mnt) (for !beneath case) or to ->mnt_parent of that (for beneath). p is set to real_mount(path->mnt) and after several uses it's made equal to mp.parent. All uses prior to that care only about p->mnt_ns and since p->mnt_ns == parent->mnt_ns, we might as well use mp.parent all along. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 952e66bdb9bb..d57e727962da 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3571,7 +3571,6 @@ static inline bool may_use_mount(struct mount *mnt) static int do_move_mount(struct path *old_path, struct path *new_path, enum mnt_tree_flags_t flags) { - struct mount *p; struct mount *old = real_mount(old_path->mnt); int err; bool beneath = flags & MNT_TREE_BENEATH; @@ -3586,8 +3585,6 @@ static int do_move_mount(struct path *old_path, if (IS_ERR(mp.parent)) return PTR_ERR(mp.parent); - p = real_mount(new_path->mnt); - if (check_mnt(old)) { /* if the source is in our namespace... */ /* ... it should be detachable from parent */ @@ -3597,7 +3594,7 @@ static int do_move_mount(struct path *old_path, if (IS_MNT_SHARED(old->mnt_parent)) return -EINVAL; /* ... and the target should be in our namespace */ - if (!check_mnt(p)) + if (!check_mnt(mp.parent)) return -EINVAL; } else { /* @@ -3610,13 +3607,13 @@ static int do_move_mount(struct path *old_path, * subsequent checks would've rejected that, but they lose * some corner cases if we check it early. */ - if (old->mnt_ns == p->mnt_ns) + if (old->mnt_ns == mp.parent->mnt_ns) return -EINVAL; /* * Target should be either in our namespace or in an acceptable * anon namespace, sensu check_anonymous_mnt(). */ - if (!may_use_mount(p)) + if (!may_use_mount(mp.parent)) return -EINVAL; } @@ -3624,22 +3621,20 @@ static int do_move_mount(struct path *old_path, err = can_move_mount_beneath(old, new_path, mp.mp); if (err) return err; - - p = p->mnt_parent; } /* * Don't move a mount tree containing unbindable mounts to a destination * mount which is shared. */ - if (IS_MNT_SHARED(p) && tree_contains_unbindable(old)) + if (IS_MNT_SHARED(mp.parent) && tree_contains_unbindable(old)) return -EINVAL; if (!check_for_nsfs_mounts(old)) return -ELOOP; - if (mount_is_ancestor(old, p)) + if (mount_is_ancestor(old, mp.parent)) return -ELOOP; - return attach_recursive_mnt(old, p, mp.mp); + return attach_recursive_mnt(old, mp.parent, mp.mp); } static int do_move_mount_old(struct path *path, const char *old_name) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 29/63] do_move_mount(): use the parent mount returned by do_lock_mount() 2025-08-28 23:07 ` [PATCH v2 29/63] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro @ 2025-09-01 11:38 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:38 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:32AM +0100, Al Viro wrote: > After successful do_lock_mount() call, mp.parent is set to either > real_mount(path->mnt) (for !beneath case) or to ->mnt_parent of that > (for beneath). p is set to real_mount(path->mnt) and after > several uses it's made equal to mp.parent. All uses prior to that > care only about p->mnt_ns and since p->mnt_ns == parent->mnt_ns, > we might as well use mp.parent all along. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 30/63] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (27 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 29/63] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-09-01 11:40 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 31/63] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro ` (32 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Both callers pass it a mountpoint reference picked from pinned_mountpoint and path it corresponds to. First of all, path->dentry is equal to mp.mp->m_dentry. Furthermore, path->mnt is &mp.parent->mnt, making struct path contents redundant. Pass it the address of that pinned_mountpoint instead; what's more, if we teach it to treat ERR_PTR(error) in ->parent as "bail out with that error" we can simplify the callers even more - do_add_mount() will do the right thing even when called after lock_mount() failure. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 32 +++++++++++++++----------------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index d57e727962da..b236536bbbc9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3657,10 +3657,13 @@ static int do_move_mount_old(struct path *path, const char *old_name) /* * add a mount into a namespace's mount tree */ -static int do_add_mount(struct mount *newmnt, struct mountpoint *mp, - const struct path *path, int mnt_flags) +static int do_add_mount(struct mount *newmnt, const struct pinned_mountpoint *mp, + int mnt_flags) { - struct mount *parent = real_mount(path->mnt); + struct mount *parent = mp->parent; + + if (IS_ERR(parent)) + return PTR_ERR(parent); mnt_flags &= ~MNT_INTERNAL_FLAGS; @@ -3674,14 +3677,15 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp, } /* Refuse the same filesystem on the same mount point */ - if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path_mounted(path)) + if (parent->mnt.mnt_sb == newmnt->mnt.mnt_sb && + parent->mnt.mnt_root == mp->mp->m_dentry) return -EBUSY; if (d_is_symlink(newmnt->mnt.mnt_root)) return -EINVAL; newmnt->mnt.mnt_flags = mnt_flags; - return graft_tree(newmnt, parent, mp); + return graft_tree(newmnt, parent, mp->mp); } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags); @@ -3711,14 +3715,9 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, mnt_warn_timestamp_expiry(mountpoint, mnt); LOCK_MOUNT(mp, mountpoint); - if (IS_ERR(mp.parent)) { - return PTR_ERR(mp.parent); - } else { - error = do_add_mount(real_mount(mnt), mp.mp, - mountpoint, mnt_flags); - if (!error) - retain_and_null_ptr(mnt); // consumed on success - } + error = do_add_mount(real_mount(mnt), &mp, mnt_flags); + if (!error) + retain_and_null_ptr(mnt); // consumed on success return error; } @@ -3824,11 +3823,10 @@ int finish_automount(struct vfsmount *__m, const struct path *path) * got", not "try to mount it on top". */ LOCK_MOUNT_EXACT(mp, path); - if (IS_ERR(mp.parent)) - return mp.parent == ERR_PTR(-EBUSY) ? 0 : PTR_ERR(mp.parent); + if (mp.parent == ERR_PTR(-EBUSY)) + return 0; - err = do_add_mount(mnt, mp.mp, path, - path->mnt->mnt_flags | MNT_SHRINKABLE); + err = do_add_mount(mnt, &mp, path->mnt->mnt_flags | MNT_SHRINKABLE); if (likely(!err)) retain_and_null_ptr(m); return err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 30/63] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path 2025-08-28 23:07 ` [PATCH v2 30/63] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro @ 2025-09-01 11:40 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:40 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:33AM +0100, Al Viro wrote: > Both callers pass it a mountpoint reference picked from pinned_mountpoint > and path it corresponds to. > > First of all, path->dentry is equal to mp.mp->m_dentry. Furthermore, path->mnt > is &mp.parent->mnt, making struct path contents redundant. > > Pass it the address of that pinned_mountpoint instead; what's more, if we > teach it to treat ERR_PTR(error) in ->parent as "bail out with that error" > we can simplify the callers even more - do_add_mount() will do the right > thing even when called after lock_mount() failure. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 31/63] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (28 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 30/63] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-09-01 11:41 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 32/63] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro ` (31 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds parent and mountpoint always come from the same struct pinned_mountpoint now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index b236536bbbc9..18d6ad0f4f76 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2549,8 +2549,7 @@ enum mnt_tree_flags_t { /** * attach_recursive_mnt - attach a source mount tree * @source_mnt: mount tree to be attached - * @dest_mnt: mount that @source_mnt will be mounted on - * @dest_mp: the mountpoint @source_mnt will be mounted at + * @dest: the context for mounting at the place where the tree should go * * NOTE: in the table below explains the semantics when a source mount * of a given type is attached to a destination mount of a given type. @@ -2613,10 +2612,11 @@ enum mnt_tree_flags_t { * Otherwise a negative error code is returned. */ static int attach_recursive_mnt(struct mount *source_mnt, - struct mount *dest_mnt, - struct mountpoint *dest_mp) + const struct pinned_mountpoint *dest) { struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns; + struct mount *dest_mnt = dest->parent; + struct mountpoint *dest_mp = dest->mp; HLIST_HEAD(tree_list); struct mnt_namespace *ns = dest_mnt->mnt_ns; struct pinned_mountpoint root = {}; @@ -2864,16 +2864,16 @@ static inline void unlock_mount(struct pinned_mountpoint *m) struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \ lock_mount_exact((path), &mp) -static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) +static int graft_tree(struct mount *mnt, const struct pinned_mountpoint *mp) { if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER) return -EINVAL; - if (d_is_dir(mp->m_dentry) != + if (d_is_dir(mp->mp->m_dentry) != d_is_dir(mnt->mnt.mnt_root)) return -ENOTDIR; - return attach_recursive_mnt(mnt, p, mp); + return attach_recursive_mnt(mnt, mp); } static int may_change_propagation(const struct mount *m) @@ -3055,7 +3055,7 @@ static int do_loopback(struct path *path, const char *old_name, if (IS_ERR(mnt)) return PTR_ERR(mnt); - err = graft_tree(mnt, mp.parent, mp.mp); + err = graft_tree(mnt, &mp); if (err) { lock_mount_hash(); umount_tree(mnt, UMOUNT_SYNC); @@ -3634,7 +3634,7 @@ static int do_move_mount(struct path *old_path, if (mount_is_ancestor(old, mp.parent)) return -ELOOP; - return attach_recursive_mnt(old, mp.parent, mp.mp); + return attach_recursive_mnt(old, &mp); } static int do_move_mount_old(struct path *path, const char *old_name) @@ -3685,7 +3685,7 @@ static int do_add_mount(struct mount *newmnt, const struct pinned_mountpoint *mp return -EINVAL; newmnt->mnt.mnt_flags = mnt_flags; - return graft_tree(newmnt, parent, mp->mp); + return graft_tree(newmnt, mp); } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 31/63] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint 2025-08-28 23:07 ` [PATCH v2 31/63] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro @ 2025-09-01 11:41 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:41 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:34AM +0100, Al Viro wrote: > parent and mountpoint always come from the same struct pinned_mountpoint > now. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 32/63] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (29 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 31/63] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 33/63] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro ` (30 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds That kills the last place where callers of lock_mount(path, &mp) used path->dentry. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 18d6ad0f4f76..02bc5294071a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4675,7 +4675,7 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, if (!mnt_has_parent(new_mnt)) return -EINVAL; /* absolute root */ /* make sure we can reach put_old from new_root */ - if (!is_path_reachable(old_mnt, old.dentry, &new)) + if (!is_path_reachable(old_mnt, old_mp.mp->m_dentry, &new)) return -EINVAL; /* make certain new is below the root */ if (!is_path_reachable(new_mnt, new.dentry, &root)) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 33/63] don't bother passing new_path->dentry to can_move_mount_beneath() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (30 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 32/63] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:20 ` Linus Torvalds 2025-08-28 23:07 ` [PATCH v2 34/63] new helper: topmost_overmount() Al Viro ` (29 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 02bc5294071a..085877bfaa5e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3450,8 +3450,8 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) /** * can_move_mount_beneath - check that we can mount beneath the top mount * @mnt_from: mount we are trying to move - * @to: mount under which to mount - * @mp: mountpoint of @to + * @mnt_to: mount under which to mount + * @mp: mountpoint of @mnt_to * * - Make sure that nothing can be mounted beneath the caller's current * root or the rootfs of the namespace. @@ -3467,11 +3467,10 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * Return: On success 0, and on error a negative error code is returned. */ static int can_move_mount_beneath(struct mount *mnt_from, - const struct path *to, + struct mount *mnt_to, const struct mountpoint *mp) { - struct mount *mnt_to = real_mount(to->mnt), - *parent_mnt_to = mnt_to->mnt_parent; + struct mount *parent_mnt_to = mnt_to->mnt_parent; if (IS_MNT_LOCKED(mnt_to)) return -EINVAL; @@ -3618,7 +3617,7 @@ static int do_move_mount(struct path *old_path, } if (beneath) { - err = can_move_mount_beneath(old, new_path, mp.mp); + err = can_move_mount_beneath(old, real_mount(new_path->mnt), mp.mp); if (err) return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 33/63] don't bother passing new_path->dentry to can_move_mount_beneath() 2025-08-28 23:07 ` [PATCH v2 33/63] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro @ 2025-08-28 23:20 ` Linus Torvalds 2025-08-28 23:39 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-08-28 23:20 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, brauner, jack On Thu, 28 Aug 2025 at 16:08, Al Viro <viro@zeniv.linux.org.uk> wrote: > > if (beneath) { > - err = can_move_mount_beneath(old, new_path, mp.mp); > + err = can_move_mount_beneath(old, real_mount(new_path->mnt), mp.mp); > if (err) > return err; > } Going through the patches, this is one that I think made things uglier... Most of them make me go "nice simplification". (I'll have a separate comment on 61/63) I certainly agree with the intent of the patch, but that can_move_mount_beneath() call line is now rather hard to read. It looked simpler before. Maybe you could just split it into two lines, and write it as if (beneath) { struct mount *new_mnt = real_mount(new_path->mnt); err = can_move_mount_beneath(old, new_mnt, mp.mp); if (err) return err; } which makes slightly less happen in that one line (and it fits in 80 columns too - not a requirement, but still "good taste") Long lines are better than randomly splitting lines unreadably into multiple lines, but short lines that are logically split are still preferred, I would say.. Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH v2 33/63] don't bother passing new_path->dentry to can_move_mount_beneath() 2025-08-28 23:20 ` Linus Torvalds @ 2025-08-28 23:39 ` Al Viro 0 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-fsdevel, brauner, jack On Thu, Aug 28, 2025 at 04:20:56PM -0700, Linus Torvalds wrote: > (I'll have a separate comment on 61/63) > > I certainly agree with the intent of the patch, but that > can_move_mount_beneath() call line is now rather hard to read. It > looked simpler before. > > Maybe you could just split it into two lines, and write it as > > if (beneath) { > struct mount *new_mnt = real_mount(new_path->mnt); > err = can_move_mount_beneath(old, new_mnt, mp.mp); > if (err) > return err; > } > > which makes slightly less happen in that one line (and it fits in 80 > columns too - not a requirement, but still "good taste") > > Long lines are better than randomly splitting lines unreadably into > multiple lines, but short lines that are logically split are still > preferred, I would say.. FWIW, if you look at #35, you'll see this: - err = can_move_mount_beneath(old, real_mount(new_path->mnt), mp.mp); + struct mount *over = real_mount(new_path->mnt); + + if (mp.parent != over->mnt_parent) + over = mp.parent->overmount; + err = can_move_mount_beneath(old, over, mp.mp); So... might as well introduce the variable in this one. Then this chunk becomes @@ -3618,7 +3617,9 @@ static int do_move_mount(struct path *old_path, } if (beneath) { - err = can_move_mount_beneath(old, new_path, mp.mp); + struct mount *over = real_mount(new_path->mnt); + + err = can_move_mount_beneath(old, over, mp.mp); if (err) return err; } and the corresponding one in #35 @@ -3618,6 +3624,8 @@ static int do_move_mount(struct path *old_path, if (beneath) { struct mount *over = real_mount(new_path->mnt); + if (mp.parent != over->mnt_parent) + over = mp.parent->overmount; err = can_move_mount_beneath(old, over, mp.mp); if (err) return err; OK, done - both certainly look better that way. ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 34/63] new helper: topmost_overmount() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (31 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 33/63] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 35/63] do_lock_mount(): don't modify path Al Viro ` (28 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Returns the final (topmost) mount in the chain of overmounts starting at given mount. Same locking rules as for any mount tree traversal - either the spinlock side of mount_lock, or rcu + sample the seqcount side of mount_lock before the call and recheck afterwards. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 7 +++++++ fs/namespace.c | 9 +++------ 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index ed8c83ba836a..04d0eadc4c10 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -235,4 +235,11 @@ static inline void mnt_notify_add(struct mount *m) } #endif +static inline struct mount *topmost_overmount(struct mount *m) +{ + while (m->overmount) + m = m->overmount; + return m; +} + struct mnt_namespace *mnt_ns_from_dentry(struct dentry *dentry); diff --git a/fs/namespace.c b/fs/namespace.c index 085877bfaa5e..ebecb03972c5 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2696,10 +2696,9 @@ static int attach_recursive_mnt(struct mount *source_mnt, child->mnt_mountpoint); commit_tree(child); if (q) { + struct mount *r = topmost_overmount(child); struct mountpoint *mp = root.mp; - struct mount *r = child; - while (unlikely(r->overmount)) - r = r->overmount; + if (unlikely(shorter) && child != source_mnt) mp = shorter; mnt_change_mountpoint(r, mp, q); @@ -6171,9 +6170,7 @@ bool current_chrooted(void) guard(mount_locked_reader)(); - root = current->nsproxy->mnt_ns->root; - while (unlikely(root->overmount)) - root = root->overmount; + root = topmost_overmount(current->nsproxy->mnt_ns->root); return fs_root.mnt != &root->mnt || !path_mounted(&fs_root); } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 35/63] do_lock_mount(): don't modify path. 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (32 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 34/63] new helper: topmost_overmount() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-09-02 10:55 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 36/63] constify check_mnt() Al Viro ` (27 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Currently do_lock_mount() has the target path switched to whatever might be overmounting it. We _do_ want to have the parent mount/mountpoint chosen on top of the overmounting pile; however, the way it's done has unpleasant races - if umount propagation removes the overmount while we'd been trying to set the environment up, we might end up failing if our target path strays into that overmount just before the overmount gets kicked out. Users of do_lock_mount() do not need the target path changed - they have all information in res->{parent,mp}; only one place (in do_move_mount()) currently uses the resulting path->mnt, and that value is trivial to reconstruct by the original value of path->mnt + chosen parent mount. Let's keep the target path unchanged; it avoids a bunch of subtle races and it's not hard to do: do as mount_locked_reader find the prospective parent mount/mountpoint dentry grab references if it's not the original target lock the prospective mountpoint dentry take namespace_sem exclusive if prospective parent/mountpoint would be different now err = -EAGAIN else if location has been unmounted err = -ENOENT else if mountpoint dentry is not allowed to be mounted on err = -ENOENT else if beneath and the top of the pile was the absolute root err = -EINVAL else try to get struct mountpoint (by dentry), set err to 0 on success and -ENO{MEM,ENT} on failure if err != 0 res->parent = ERR_PTR(err) drop locks else res->parent = prospective parent drop temporary references while err == -EAGAIN A somewhat subtle part is that dropping temporary references is allowed. Neither mounts nor dentries should be evicted by a thread that holds namespace_sem. On success we are dropping those references under namespace_sem, so we need to be sure that these are not the last references remaining. However, on success we'd already verified (under namespace_sem) that original target is still mounted and that mount and dentry we are about to drop are still reachable from it via the mount tree. That guarantees that we are not about to drop the last remaining references. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 126 ++++++++++++++++++++++++++----------------------- 1 file changed, 68 insertions(+), 58 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index ebecb03972c5..b77d2df606a1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2727,6 +2727,27 @@ static int attach_recursive_mnt(struct mount *source_mnt, return err; } +static inline struct mount *where_to_mount(const struct path *path, + struct dentry **dentry, + bool beneath) +{ + struct mount *m; + + if (unlikely(beneath)) { + m = topmost_overmount(real_mount(path->mnt)); + *dentry = m->mnt_mountpoint; + return m->mnt_parent; + } else { + m = __lookup_mnt(path->mnt, *dentry = path->dentry); + if (unlikely(m)) { + m = topmost_overmount(m); + *dentry = m->mnt.mnt_root; + return m; + } + return real_mount(path->mnt); + } +} + /** * do_lock_mount - acquire environment for mounting * @path: target path @@ -2758,84 +2779,69 @@ static int attach_recursive_mnt(struct mount *source_mnt, * case we also require the location to be at the root of a mount * that has a parent (i.e. is not a root of some namespace). */ -static void do_lock_mount(struct path *path, struct pinned_mountpoint *res, bool beneath) +static void do_lock_mount(const struct path *path, + struct pinned_mountpoint *res, + bool beneath) { - struct vfsmount *mnt = path->mnt; - struct dentry *dentry; - struct path under = {}; - int err = -ENOENT; + int err; if (unlikely(beneath) && !path_mounted(path)) { res->parent = ERR_PTR(-EINVAL); return; } - for (;;) { - struct mount *m = real_mount(mnt); - - if (beneath) { - path_put(&under); - read_seqlock_excl(&mount_lock); - if (unlikely(!mnt_has_parent(m))) { - read_sequnlock_excl(&mount_lock); - res->parent = ERR_PTR(-EINVAL); - return; + do { + struct dentry *dentry, *d; + struct mount *m, *n; + + scoped_guard(mount_locked_reader) { + m = where_to_mount(path, &dentry, beneath); + if (&m->mnt != path->mnt) { + mntget(&m->mnt); + dget(dentry); } - under.mnt = mntget(&m->mnt_parent->mnt); - under.dentry = dget(m->mnt_mountpoint); - read_sequnlock_excl(&mount_lock); - dentry = under.dentry; - } else { - dentry = path->dentry; } inode_lock(dentry->d_inode); namespace_lock(); - if (unlikely(cant_mount(dentry) || !is_mounted(mnt))) - break; // not to be mounted on + // check if the chain of mounts (if any) has changed. + scoped_guard(mount_locked_reader) + n = where_to_mount(path, &d, beneath); - if (beneath && unlikely(m->mnt_mountpoint != dentry || - &m->mnt_parent->mnt != under.mnt)) { - namespace_unlock(); - inode_unlock(dentry->d_inode); - continue; // got moved - } + if (unlikely(n != m || dentry != d)) + err = -EAGAIN; // something moved, retry + else if (unlikely(cant_mount(dentry) || !is_mounted(path->mnt))) + err = -ENOENT; // not to be mounted on + else if (beneath && &m->mnt == path->mnt && !m->overmount) + err = -EINVAL; + else + err = get_mountpoint(dentry, res); - mnt = lookup_mnt(path); - if (unlikely(mnt)) { + if (unlikely(err)) { + res->parent = ERR_PTR(err); namespace_unlock(); inode_unlock(dentry->d_inode); - path_put(path); - path->mnt = mnt; - path->dentry = dget(mnt->mnt_root); - continue; // got overmounted + } else { + res->parent = m; } - err = get_mountpoint(dentry, res); - if (err) - break; - if (beneath) { - /* - * @under duplicates the references that will stay - * at least until namespace_unlock(), so the path_put() - * below is safe (and OK to do under namespace_lock - - * we are not dropping the final references here). - */ - path_put(&under); - res->parent = real_mount(path->mnt)->mnt_parent; - return; + /* + * Drop the temporary references. This is subtle - on success + * we are doing that under namespace_sem, which would normally + * be forbidden. However, in that case we are guaranteed that + * refcounts won't reach zero, since we know that path->mnt + * is mounted and thus all mounts reachable from it are pinned + * and stable, along with their mountpoints and roots. + */ + if (&m->mnt != path->mnt) { + dput(dentry); + mntput(&m->mnt); } - res->parent = real_mount(path->mnt); - return; - } - namespace_unlock(); - inode_unlock(dentry->d_inode); - if (beneath) - path_put(&under); - res->parent = ERR_PTR(err); + } while (err == -EAGAIN); } -static inline void lock_mount(struct path *path, struct pinned_mountpoint *m) +static inline void lock_mount(const struct path *path, + struct pinned_mountpoint *m) { do_lock_mount(path, m, false); } @@ -3616,7 +3622,11 @@ static int do_move_mount(struct path *old_path, } if (beneath) { - err = can_move_mount_beneath(old, real_mount(new_path->mnt), mp.mp); + struct mount *over = real_mount(new_path->mnt); + + if (mp.parent != over->mnt_parent) + over = mp.parent->overmount; + err = can_move_mount_beneath(old, over, mp.mp); if (err) return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 35/63] do_lock_mount(): don't modify path. 2025-08-28 23:07 ` [PATCH v2 35/63] do_lock_mount(): don't modify path Al Viro @ 2025-09-02 10:55 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-02 10:55 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:38AM +0100, Al Viro wrote: > Currently do_lock_mount() has the target path switched to whatever > might be overmounting it. We _do_ want to have the parent > mount/mountpoint chosen on top of the overmounting pile; however, > the way it's done has unpleasant races - if umount propagation > removes the overmount while we'd been trying to set the environment > up, we might end up failing if our target path strays into that overmount > just before the overmount gets kicked out. > > Users of do_lock_mount() do not need the target path changed - they > have all information in res->{parent,mp}; only one place (in > do_move_mount()) currently uses the resulting path->mnt, and that value > is trivial to reconstruct by the original value of path->mnt + chosen > parent mount. > > Let's keep the target path unchanged; it avoids a bunch of subtle races > and it's not hard to do: > do > as mount_locked_reader > find the prospective parent mount/mountpoint dentry > grab references if it's not the original target > lock the prospective mountpoint dentry > take namespace_sem exclusive > if prospective parent/mountpoint would be different now > err = -EAGAIN > else if location has been unmounted > err = -ENOENT > else if mountpoint dentry is not allowed to be mounted on > err = -ENOENT > else if beneath and the top of the pile was the absolute root > err = -EINVAL > else > try to get struct mountpoint (by dentry), set > err to 0 on success and -ENO{MEM,ENT} on failure > if err != 0 > res->parent = ERR_PTR(err) > drop locks > else > res->parent = prospective parent > drop temporary references > while err == -EAGAIN > > A somewhat subtle part is that dropping temporary references is allowed. > Neither mounts nor dentries should be evicted by a thread that holds > namespace_sem. On success we are dropping those references under > namespace_sem, so we need to be sure that these are not the last > references remaining. However, on success we'd already verified (under > namespace_sem) that original target is still mounted and that mount > and dentry we are about to drop are still reachable from it via the > mount tree. That guarantees that we are not about to drop the last > remaining references. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/namespace.c | 126 ++++++++++++++++++++++++++----------------------- > 1 file changed, 68 insertions(+), 58 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index ebecb03972c5..b77d2df606a1 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2727,6 +2727,27 @@ static int attach_recursive_mnt(struct mount *source_mnt, > return err; > } > > +static inline struct mount *where_to_mount(const struct path *path, > + struct dentry **dentry, > + bool beneath) > +{ > + struct mount *m; > + > + if (unlikely(beneath)) { > + m = topmost_overmount(real_mount(path->mnt)); > + *dentry = m->mnt_mountpoint; > + return m->mnt_parent; No need for that else. This can just be: if (unlikely(beneath)) { m = topmost_overmount(real_mount(path->mnt)); *dentry = m->mnt_mountpoint; return m->mnt_parent; } m = __lookup_mnt(path->mnt, *dentry = path->dentry); if (unlikely(m)) { m = topmost_overmount(m); *dentry = m->mnt.mnt_root; return m; } return real_mount(path->mnt); > + } else { > + m = __lookup_mnt(path->mnt, *dentry = path->dentry); The assignment to *dentry during argument passing looks really weird. I would prefer if we didn't do that. > + if (unlikely(m)) { > + m = topmost_overmount(m); > + *dentry = m->mnt.mnt_root; > + return m; > + } > + return real_mount(path->mnt); > + } > +} > + > /** > * do_lock_mount - acquire environment for mounting > * @path: target path > @@ -2758,84 +2779,69 @@ static int attach_recursive_mnt(struct mount *source_mnt, > * case we also require the location to be at the root of a mount > * that has a parent (i.e. is not a root of some namespace). > */ > -static void do_lock_mount(struct path *path, struct pinned_mountpoint *res, bool beneath) > +static void do_lock_mount(const struct path *path, > + struct pinned_mountpoint *res, > + bool beneath) > { > - struct vfsmount *mnt = path->mnt; > - struct dentry *dentry; > - struct path under = {}; > - int err = -ENOENT; > + int err; > > if (unlikely(beneath) && !path_mounted(path)) { > res->parent = ERR_PTR(-EINVAL); > return; > } > > - for (;;) { > - struct mount *m = real_mount(mnt); > - > - if (beneath) { > - path_put(&under); > - read_seqlock_excl(&mount_lock); > - if (unlikely(!mnt_has_parent(m))) { > - read_sequnlock_excl(&mount_lock); > - res->parent = ERR_PTR(-EINVAL); > - return; > + do { > + struct dentry *dentry, *d; > + struct mount *m, *n; > + > + scoped_guard(mount_locked_reader) { > + m = where_to_mount(path, &dentry, beneath); > + if (&m->mnt != path->mnt) { > + mntget(&m->mnt); > + dget(dentry); > } > - under.mnt = mntget(&m->mnt_parent->mnt); > - under.dentry = dget(m->mnt_mountpoint); > - read_sequnlock_excl(&mount_lock); > - dentry = under.dentry; > - } else { > - dentry = path->dentry; > } > > inode_lock(dentry->d_inode); > namespace_lock(); > > - if (unlikely(cant_mount(dentry) || !is_mounted(mnt))) > - break; // not to be mounted on > + // check if the chain of mounts (if any) has changed. > + scoped_guard(mount_locked_reader) > + n = where_to_mount(path, &d, beneath); > > - if (beneath && unlikely(m->mnt_mountpoint != dentry || > - &m->mnt_parent->mnt != under.mnt)) { > - namespace_unlock(); > - inode_unlock(dentry->d_inode); > - continue; // got moved > - } > + if (unlikely(n != m || dentry != d)) > + err = -EAGAIN; // something moved, retry > + else if (unlikely(cant_mount(dentry) || !is_mounted(path->mnt))) > + err = -ENOENT; // not to be mounted on > + else if (beneath && &m->mnt == path->mnt && !m->overmount) > + err = -EINVAL; > + else > + err = get_mountpoint(dentry, res); > > - mnt = lookup_mnt(path); > - if (unlikely(mnt)) { > + if (unlikely(err)) { > + res->parent = ERR_PTR(err); > namespace_unlock(); > inode_unlock(dentry->d_inode); > - path_put(path); > - path->mnt = mnt; > - path->dentry = dget(mnt->mnt_root); > - continue; // got overmounted > + } else { > + res->parent = m; > } > - err = get_mountpoint(dentry, res); > - if (err) > - break; > - if (beneath) { > - /* > - * @under duplicates the references that will stay > - * at least until namespace_unlock(), so the path_put() > - * below is safe (and OK to do under namespace_lock - > - * we are not dropping the final references here). > - */ > - path_put(&under); > - res->parent = real_mount(path->mnt)->mnt_parent; > - return; > + /* > + * Drop the temporary references. This is subtle - on success > + * we are doing that under namespace_sem, which would normally > + * be forbidden. However, in that case we are guaranteed that > + * refcounts won't reach zero, since we know that path->mnt > + * is mounted and thus all mounts reachable from it are pinned "is mounted and we hold the namespace semaphore and thus all mounts reachable [...]" With these things fixed: Reviewed-by: Christian Brauner <brauner@kernel.org> Unless I forgot something this means I should've gone through the whole series. ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 36/63] constify check_mnt() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (33 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 35/63] do_lock_mount(): don't modify path Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 37/63] do_mount_setattr(): constify path argument Al Viro ` (26 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index b77d2df606a1..de894f96d9c2 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1010,7 +1010,7 @@ static void unpin_mountpoint(struct pinned_mountpoint *m) } } -static inline int check_mnt(struct mount *mnt) +static inline int check_mnt(const struct mount *mnt) { return mnt->mnt_ns == current->nsproxy->mnt_ns; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 37/63] do_mount_setattr(): constify path argument 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (34 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 36/63] constify check_mnt() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 38/63] do_set_group(): constify path arguments Al Viro ` (25 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index de894f96d9c2..5766d6a3a279 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4865,7 +4865,7 @@ static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt) touch_mnt_namespace(mnt->mnt_ns); } -static int do_mount_setattr(struct path *path, struct mount_kattr *kattr) +static int do_mount_setattr(const struct path *path, struct mount_kattr *kattr) { struct mount *mnt = real_mount(path->mnt); int err = 0; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 38/63] do_set_group(): constify path arguments 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (35 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 37/63] do_mount_setattr(): constify path argument Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 39/63] drop_collected_paths(): constify arguments Al Viro ` (24 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 5766d6a3a279..e4ca76091bd7 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3359,7 +3359,7 @@ static inline int tree_contains_unbindable(struct mount *mnt) return 0; } -static int do_set_group(struct path *from_path, struct path *to_path) +static int do_set_group(const struct path *from_path, const struct path *to_path) { struct mount *from = real_mount(from_path->mnt); struct mount *to = real_mount(to_path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 39/63] drop_collected_paths(): constify arguments 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (36 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 38/63] do_set_group(): constify path arguments Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 40/63] collect_paths(): constify the return value Al Viro ` (23 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and use that to constify the pointers in callers Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- include/linux/mount.h | 2 +- kernel/audit_tree.c | 12 ++++++------ 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index e4ca76091bd7..61dfa899bd57 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2334,9 +2334,9 @@ struct path *collect_paths(const struct path *path, return res; } -void drop_collected_paths(struct path *paths, struct path *prealloc) +void drop_collected_paths(const struct path *paths, struct path *prealloc) { - for (struct path *p = paths; p->mnt; p++) + for (const struct path *p = paths; p->mnt; p++) path_put(p); if (paths != prealloc) kfree(paths); diff --git a/include/linux/mount.h b/include/linux/mount.h index 5f9c053b0897..c09032463b36 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -105,7 +105,7 @@ extern int may_umount(struct vfsmount *); int do_mount(const char *, const char __user *, const char *, unsigned long, void *); extern struct path *collect_paths(const struct path *, struct path *, unsigned); -extern void drop_collected_paths(struct path *, struct path *); +extern void drop_collected_paths(const struct path *, struct path *); extern void kern_unmount_array(struct vfsmount *mnt[], unsigned int num); extern int cifs_root_data(char **dev, char **opts); diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index b0eae2a3c895..32007edf0e55 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -678,7 +678,7 @@ void audit_trim_trees(void) struct audit_tree *tree; struct path path; struct audit_node *node; - struct path *paths; + const struct path *paths; struct path array[16]; int err; @@ -701,7 +701,7 @@ void audit_trim_trees(void) struct audit_chunk *chunk = find_chunk(node); /* this could be NULL if the watch is dying else where... */ node->index |= 1U<<31; - for (struct path *p = paths; p->dentry; p++) { + for (const struct path *p = paths; p->dentry; p++) { struct inode *inode = p->dentry->d_inode; if (inode_to_key(inode) == chunk->key) { node->index &= ~(1U<<31); @@ -740,9 +740,9 @@ void audit_put_tree(struct audit_tree *tree) put_tree(tree); } -static int tag_mounts(struct path *paths, struct audit_tree *tree) +static int tag_mounts(const struct path *paths, struct audit_tree *tree) { - for (struct path *p = paths; p->dentry; p++) { + for (const struct path *p = paths; p->dentry; p++) { int err = tag_chunk(p->dentry->d_inode, tree); if (err) return err; @@ -805,7 +805,7 @@ int audit_add_tree_rule(struct audit_krule *rule) struct audit_tree *seed = rule->tree, *tree; struct path path; struct path array[16]; - struct path *paths; + const struct path *paths; int err; rule->tree = NULL; @@ -877,7 +877,7 @@ int audit_tag_tree(char *old, char *new) int failed = 0; struct path path1, path2; struct path array[16]; - struct path *paths; + const struct path *paths; int err; err = kern_path(new, 0, &path2); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 40/63] collect_paths(): constify the return value 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (37 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 39/63] drop_collected_paths(): constify arguments Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 41/63] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro ` (22 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds callers have no business modifying the paths they get Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- include/linux/mount.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 61dfa899bd57..43f46d9e84fe 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2300,7 +2300,7 @@ static inline bool extend_array(struct path **res, struct path **to_free, return p; } -struct path *collect_paths(const struct path *path, +const struct path *collect_paths(const struct path *path, struct path *prealloc, unsigned count) { struct mount *root = real_mount(path->mnt); @@ -2334,7 +2334,7 @@ struct path *collect_paths(const struct path *path, return res; } -void drop_collected_paths(const struct path *paths, struct path *prealloc) +void drop_collected_paths(const struct path *paths, const struct path *prealloc) { for (const struct path *p = paths; p->mnt; p++) path_put(p); diff --git a/include/linux/mount.h b/include/linux/mount.h index c09032463b36..18e4b97f8a98 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -104,8 +104,8 @@ extern int may_umount_tree(struct vfsmount *); extern int may_umount(struct vfsmount *); int do_mount(const char *, const char __user *, const char *, unsigned long, void *); -extern struct path *collect_paths(const struct path *, struct path *, unsigned); -extern void drop_collected_paths(const struct path *, struct path *); +extern const struct path *collect_paths(const struct path *, struct path *, unsigned); +extern void drop_collected_paths(const struct path *, const struct path *); extern void kern_unmount_array(struct vfsmount *mnt[], unsigned int num); extern int cifs_root_data(char **dev, char **opts); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 41/63] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (38 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 40/63] collect_paths(): constify the return value Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 42/63] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro ` (21 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 43f46d9e84fe..70ae769ecf11 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3572,8 +3572,9 @@ static inline bool may_use_mount(struct mount *mnt) return check_anonymous_mnt(mnt); } -static int do_move_mount(struct path *old_path, - struct path *new_path, enum mnt_tree_flags_t flags) +static int do_move_mount(const struct path *old_path, + const struct path *new_path, + enum mnt_tree_flags_t flags) { struct mount *old = real_mount(old_path->mnt); int err; @@ -3645,7 +3646,7 @@ static int do_move_mount(struct path *old_path, return attach_recursive_mnt(old, &mp); } -static int do_move_mount_old(struct path *path, const char *old_name) +static int do_move_mount_old(const struct path *path, const char *old_name) { struct path old_path; int err; @@ -4475,7 +4476,8 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, return ret; } -static inline int vfs_move_mount(struct path *from_path, struct path *to_path, +static inline int vfs_move_mount(const struct path *from_path, + const struct path *to_path, enum mnt_tree_flags_t mflags) { int ret; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 42/63] mnt_warn_timestamp_expiry(): constify struct path argument 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (39 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 41/63] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 43/63] do_new_mount{,_fc}(): " Al Viro ` (20 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 70ae769ecf11..a7c840371a7f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3230,7 +3230,8 @@ static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags) touch_mnt_namespace(mnt->mnt_ns); } -static void mnt_warn_timestamp_expiry(struct path *mountpoint, struct vfsmount *mnt) +static void mnt_warn_timestamp_expiry(const struct path *mountpoint, + struct vfsmount *mnt) { struct super_block *sb = mnt->mnt_sb; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 43/63] do_new_mount{,_fc}(): constify struct path argument 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (40 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 42/63] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 44/63] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro ` (19 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a7c840371a7f..8ff54e0da446 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3704,7 +3704,7 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags * Create a new mount using a superblock configuration and request it * be added to the namespace tree. */ -static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, +static int do_new_mount_fc(struct fs_context *fc, const struct path *mountpoint, unsigned int mnt_flags) { struct super_block *sb; @@ -3735,8 +3735,9 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, * create a new mount for userspace and request it to be added into the * namespace's tree */ -static int do_new_mount(struct path *path, const char *fstype, int sb_flags, - int mnt_flags, const char *name, void *data) +static int do_new_mount(const struct path *path, const char *fstype, + int sb_flags, int mnt_flags, + const char *name, void *data) { struct file_system_type *type; struct fs_context *fc; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 44/63] do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (41 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 43/63] do_new_mount{,_fc}(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 45/63] path_mount(): " Al Viro ` (18 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 8ff54e0da446..6ae42f3a9f10 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2914,7 +2914,7 @@ static int flags_to_propagation_type(int ms_flags) /* * recursively change the type of the mountpoint. */ -static int do_change_type(struct path *path, int ms_flags) +static int do_change_type(const struct path *path, int ms_flags) { struct mount *m; struct mount *mnt = real_mount(path->mnt); @@ -3034,8 +3034,8 @@ static struct mount *__do_loopback(struct path *old_path, int recurse) /* * do loopback mount. */ -static int do_loopback(struct path *path, const char *old_name, - int recurse) +static int do_loopback(const struct path *path, const char *old_name, + int recurse) { struct path old_path __free(path_put) = {}; struct mount *mnt = NULL; @@ -3265,7 +3265,7 @@ static void mnt_warn_timestamp_expiry(const struct path *mountpoint, * superblock it refers to. This is triggered by specifying MS_REMOUNT|MS_BIND * to mount(2). */ -static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags) +static int do_reconfigure_mnt(const struct path *path, unsigned int mnt_flags) { struct super_block *sb = path->mnt->mnt_sb; struct mount *mnt = real_mount(path->mnt); @@ -3302,7 +3302,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags) * If you've mounted a non-root directory somewhere and want to do remount * on it - tough luck. */ -static int do_remount(struct path *path, int ms_flags, int sb_flags, +static int do_remount(const struct path *path, int ms_flags, int sb_flags, int mnt_flags, void *data) { int err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 45/63] path_mount(): constify struct path argument 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (42 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 44/63] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 46/63] may_copy_tree(), __do_loopback(): " Al Viro ` (17 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds now it finally can be done. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/internal.h | 2 +- fs/namespace.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 38e8aab27bbd..fe88563b4822 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -84,7 +84,7 @@ void mnt_put_write_access_file(struct file *file); extern void dissolve_on_fput(struct vfsmount *); extern bool may_mount(void); -int path_mount(const char *dev_name, struct path *path, +int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page); int path_umount(struct path *path, int flags); diff --git a/fs/namespace.c b/fs/namespace.c index 6ae42f3a9f10..34a71d5cdf88 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4018,7 +4018,7 @@ static char *copy_mount_string(const void __user *data) * Therefore, if this magic number is present, it carries no information * and must be discarded. */ -int path_mount(const char *dev_name, struct path *path, +int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page) { unsigned int mnt_flags = 0, sb_flags; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 46/63] may_copy_tree(), __do_loopback(): constify struct path argument 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (43 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 45/63] path_mount(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 47/63] path_umount(): " Al Viro ` (16 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 34a71d5cdf88..b15632b70223 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2990,7 +2990,7 @@ static int do_change_type(const struct path *path, int ms_flags) * * Returns true if the mount tree can be copied, false otherwise. */ -static inline bool may_copy_tree(struct path *path) +static inline bool may_copy_tree(const struct path *path) { struct mount *mnt = real_mount(path->mnt); const struct dentry_operations *d_op; @@ -3012,7 +3012,7 @@ static inline bool may_copy_tree(struct path *path) } -static struct mount *__do_loopback(struct path *old_path, int recurse) +static struct mount *__do_loopback(const struct path *old_path, int recurse) { struct mount *old = real_mount(old_path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 47/63] path_umount(): constify struct path argument 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (44 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 46/63] may_copy_tree(), __do_loopback(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 48/63] constify can_move_mount_beneath() arguments Al Viro ` (15 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/internal.h | 2 +- fs/namespace.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index fe88563b4822..549e6bd453b0 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -86,7 +86,7 @@ extern bool may_mount(void); int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page); -int path_umount(struct path *path, int flags); +int path_umount(const struct path *path, int flags); int show_path(struct seq_file *m, struct dentry *root); diff --git a/fs/namespace.c b/fs/namespace.c index b15632b70223..a14cb2cabc1a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2084,7 +2084,7 @@ static int can_umount(const struct path *path, int flags) } // caller is responsible for flags being sane -int path_umount(struct path *path, int flags) +int path_umount(const struct path *path, int flags) { struct mount *mnt = real_mount(path->mnt); int ret; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 48/63] constify can_move_mount_beneath() arguments 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (45 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 47/63] path_umount(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 49/63] do_move_mount_old(): use __free(path_put) Al Viro ` (14 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a14cb2cabc1a..daca5e3bec38 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3472,8 +3472,8 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * Context: This function expects namespace_lock() to be held. * Return: On success 0, and on error a negative error code is returned. */ -static int can_move_mount_beneath(struct mount *mnt_from, - struct mount *mnt_to, +static int can_move_mount_beneath(const struct mount *mnt_from, + const struct mount *mnt_to, const struct mountpoint *mp) { struct mount *parent_mnt_to = mnt_to->mnt_parent; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 49/63] do_move_mount_old(): use __free(path_put) 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (46 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 48/63] constify can_move_mount_beneath() arguments Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 50/63] do_mount(): " Al Viro ` (13 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index daca5e3bec38..a57598ec422a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3649,7 +3649,7 @@ static int do_move_mount(const struct path *old_path, static int do_move_mount_old(const struct path *path, const char *old_name) { - struct path old_path; + struct path old_path __free(path_put) = {}; int err; if (!old_name || !*old_name) @@ -3659,9 +3659,7 @@ static int do_move_mount_old(const struct path *path, const char *old_name) if (err) return err; - err = do_move_mount(&old_path, path, 0); - path_put(&old_path); - return err; + return do_move_mount(&old_path, path, 0); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 50/63] do_mount(): use __free(path_put) 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (47 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 49/63] do_move_mount_old(): use __free(path_put) Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 51/63] umount_tree(): take all victims out of propagation graph at once Al Viro ` (12 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a57598ec422a..b290e2b3bcfb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4098,15 +4098,13 @@ int path_mount(const char *dev_name, const struct path *path, int do_mount(const char *dev_name, const char __user *dir_name, const char *type_page, unsigned long flags, void *data_page) { - struct path path; + struct path path __free(path_put) = {}; int ret; ret = user_path_at(AT_FDCWD, dir_name, LOOKUP_FOLLOW, &path); if (ret) return ret; - ret = path_mount(dev_name, &path, type_page, flags, data_page); - path_put(&path); - return ret; + return path_mount(dev_name, &path, type_page, flags, data_page); } static struct ucounts *inc_mnt_namespaces(struct user_namespace *ns) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 51/63] umount_tree(): take all victims out of propagation graph at once 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (48 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 50/63] do_mount(): " Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-09-01 11:50 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 52/63] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro ` (11 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds For each removed mount we need to calculate where the slaves will end up. To avoid duplicating that work, do it for all mounts to be removed at once, taking the mounts themselves out of propagation graph as we go, then do all transfers; the duplicate work on finding destinations is avoided since if we run into a mount that already had destination found, we don't need to trace the rest of the way. That's guaranteed O(removed mounts) for finding destinations and removing from propagation graph and O(surviving mounts that have master removed) for transfers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 ++- fs/pnode.c | 67 +++++++++++++++++++++++++++++++++++++++----------- fs/pnode.h | 1 + 3 files changed, 55 insertions(+), 16 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index b290e2b3bcfb..de9a88f45dc1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1846,6 +1846,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how) if (how & UMOUNT_PROPAGATE) propagate_umount(&tmp_list); + bulk_make_private(&tmp_list); + while (!list_empty(&tmp_list)) { struct mnt_namespace *ns; bool disconnect; @@ -1870,7 +1872,6 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how) umount_mnt(p); } } - change_mnt_propagation(p, MS_PRIVATE); if (disconnect) hlist_add_head(&p->mnt_umount, &unmounted); diff --git a/fs/pnode.c b/fs/pnode.c index edaf9d9d0eaf..5d91c3e58d2a 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -71,19 +71,6 @@ static inline bool will_be_unmounted(struct mount *m) return m->mnt.mnt_flags & MNT_UMOUNT; } -static struct mount *propagation_source(struct mount *mnt) -{ - do { - struct mount *m; - for (m = next_peer(mnt); m != mnt; m = next_peer(m)) { - if (!will_be_unmounted(m)) - return m; - } - mnt = mnt->mnt_master; - } while (mnt && will_be_unmounted(mnt)); - return mnt; -} - static void transfer_propagation(struct mount *mnt, struct mount *to) { struct hlist_node *p = NULL, *n; @@ -112,11 +99,10 @@ void change_mnt_propagation(struct mount *mnt, int type) return; } if (IS_MNT_SHARED(mnt)) { - if (type == MS_SLAVE || !hlist_empty(&mnt->mnt_slave_list)) - m = propagation_source(mnt); if (list_empty(&mnt->mnt_share)) { mnt_release_group_id(mnt); } else { + m = next_peer(mnt); list_del_init(&mnt->mnt_share); mnt->mnt_group_id = 0; } @@ -137,6 +123,57 @@ void change_mnt_propagation(struct mount *mnt, int type) } } +static struct mount *trace_transfers(struct mount *m) +{ + while (1) { + struct mount *next = next_peer(m); + + if (next != m) { + list_del_init(&m->mnt_share); + m->mnt_group_id = 0; + m->mnt_master = next; + } else { + if (IS_MNT_SHARED(m)) + mnt_release_group_id(m); + next = m->mnt_master; + } + hlist_del_init(&m->mnt_slave); + CLEAR_MNT_SHARED(m); + SET_MNT_MARK(m); + + if (!next || !will_be_unmounted(next)) + return next; + if (IS_MNT_MARKED(next)) + return next->mnt_master; + m = next; + } +} + +static void set_destinations(struct mount *m, struct mount *master) +{ + struct mount *next; + + while ((next = m->mnt_master) != master) { + m->mnt_master = master; + m = next; + } +} + +void bulk_make_private(struct list_head *set) +{ + struct mount *m; + + list_for_each_entry(m, set, mnt_list) + if (!IS_MNT_MARKED(m)) + set_destinations(m, trace_transfers(m)); + + list_for_each_entry(m, set, mnt_list) { + transfer_propagation(m, m->mnt_master); + m->mnt_master = NULL; + CLEAR_MNT_MARK(m); + } +} + static struct mount *__propagation_next(struct mount *m, struct mount *origin) { diff --git a/fs/pnode.h b/fs/pnode.h index 00ab153e3e9d..b029db225f33 100644 --- a/fs/pnode.h +++ b/fs/pnode.h @@ -42,6 +42,7 @@ static inline bool peers(const struct mount *m1, const struct mount *m2) } void change_mnt_propagation(struct mount *, int); +void bulk_make_private(struct list_head *); int propagate_mnt(struct mount *, struct mountpoint *, struct mount *, struct hlist_head *); void propagate_umount(struct list_head *); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 51/63] umount_tree(): take all victims out of propagation graph at once 2025-08-28 23:07 ` [PATCH v2 51/63] umount_tree(): take all victims out of propagation graph at once Al Viro @ 2025-09-01 11:50 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:50 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:54AM +0100, Al Viro wrote: > For each removed mount we need to calculate where the slaves will end up. > To avoid duplicating that work, do it for all mounts to be removed > at once, taking the mounts themselves out of propagation graph as > we go, then do all transfers; the duplicate work on finding destinations > is avoided since if we run into a mount that already had destination found, > we don't need to trace the rest of the way. That's guaranteed > O(removed mounts) for finding destinations and removing from propagation > graph and O(surviving mounts that have master removed) for transfers. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 52/63] ecryptfs: get rid of pointless mount references in ecryptfs dentries 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (49 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 51/63] umount_tree(): take all victims out of propagation graph at once Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 53/63] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro ` (10 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ->lower_path.mnt has the same value for all dentries on given ecryptfs instance and if somebody goes for mountpoint-crossing variant where that would not be true, we can deal with that when it happens (and _not_ with duplicating these reference into each dentry). As it is, we are better off just sticking a reference into ecryptfs-private part of superblock and keeping it pinned until ->kill_sb(). That way we can stick a reference to underlying dentry right into ->d_fsdata of ecryptfs one, getting rid of indirection through struct ecryptfs_dentry_info, along with the entire struct ecryptfs_dentry_info machinery. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/ecryptfs/dentry.c | 14 +------------- fs/ecryptfs/ecryptfs_kernel.h | 27 +++++++++++---------------- fs/ecryptfs/file.c | 15 +++++++-------- fs/ecryptfs/inode.c | 19 +++++-------------- fs/ecryptfs/main.c | 24 ++++++------------------ 5 files changed, 30 insertions(+), 69 deletions(-) diff --git a/fs/ecryptfs/dentry.c b/fs/ecryptfs/dentry.c index 1dfd5b81d831..6648a924e31a 100644 --- a/fs/ecryptfs/dentry.c +++ b/fs/ecryptfs/dentry.c @@ -59,14 +59,6 @@ static int ecryptfs_d_revalidate(struct inode *dir, const struct qstr *name, return rc; } -struct kmem_cache *ecryptfs_dentry_info_cache; - -static void ecryptfs_dentry_free_rcu(struct rcu_head *head) -{ - kmem_cache_free(ecryptfs_dentry_info_cache, - container_of(head, struct ecryptfs_dentry_info, rcu)); -} - /** * ecryptfs_d_release * @dentry: The ecryptfs dentry @@ -75,11 +67,7 @@ static void ecryptfs_dentry_free_rcu(struct rcu_head *head) */ static void ecryptfs_d_release(struct dentry *dentry) { - struct ecryptfs_dentry_info *p = dentry->d_fsdata; - if (p) { - path_put(&p->lower_path); - call_rcu(&p->rcu, ecryptfs_dentry_free_rcu); - } + dput(dentry->d_fsdata); } const struct dentry_operations ecryptfs_dops = { diff --git a/fs/ecryptfs/ecryptfs_kernel.h b/fs/ecryptfs/ecryptfs_kernel.h index 1f562e75d0e4..9e6ab0b41337 100644 --- a/fs/ecryptfs/ecryptfs_kernel.h +++ b/fs/ecryptfs/ecryptfs_kernel.h @@ -258,13 +258,6 @@ struct ecryptfs_inode_info { struct ecryptfs_crypt_stat crypt_stat; }; -/* dentry private data. Each dentry must keep track of a lower - * vfsmount too. */ -struct ecryptfs_dentry_info { - struct path lower_path; - struct rcu_head rcu; -}; - /** * ecryptfs_global_auth_tok - A key used to encrypt all new files under the mountpoint * @flags: Status flags @@ -348,6 +341,7 @@ struct ecryptfs_mount_crypt_stat { /* superblock private data. */ struct ecryptfs_sb_info { struct super_block *wsi_sb; + struct vfsmount *lower_mnt; struct ecryptfs_mount_crypt_stat mount_crypt_stat; }; @@ -494,22 +488,25 @@ ecryptfs_set_superblock_lower(struct super_block *sb, } static inline void -ecryptfs_set_dentry_private(struct dentry *dentry, - struct ecryptfs_dentry_info *dentry_info) +ecryptfs_set_dentry_lower(struct dentry *dentry, + struct dentry *lower_dentry) { - dentry->d_fsdata = dentry_info; + dentry->d_fsdata = lower_dentry; } static inline struct dentry * ecryptfs_dentry_to_lower(struct dentry *dentry) { - return ((struct ecryptfs_dentry_info *)dentry->d_fsdata)->lower_path.dentry; + return dentry->d_fsdata; } -static inline const struct path * -ecryptfs_dentry_to_lower_path(struct dentry *dentry) +static inline struct path +ecryptfs_lower_path(struct dentry *dentry) { - return &((struct ecryptfs_dentry_info *)dentry->d_fsdata)->lower_path; + return (struct path){ + .mnt = ecryptfs_superblock_to_private(dentry->d_sb)->lower_mnt, + .dentry = ecryptfs_dentry_to_lower(dentry) + }; } #define ecryptfs_printk(type, fmt, arg...) \ @@ -532,7 +529,6 @@ extern unsigned int ecryptfs_number_of_users; extern struct kmem_cache *ecryptfs_auth_tok_list_item_cache; extern struct kmem_cache *ecryptfs_file_info_cache; -extern struct kmem_cache *ecryptfs_dentry_info_cache; extern struct kmem_cache *ecryptfs_inode_info_cache; extern struct kmem_cache *ecryptfs_sb_info_cache; extern struct kmem_cache *ecryptfs_header_cache; @@ -557,7 +553,6 @@ int ecryptfs_encrypt_and_encode_filename( size_t *encoded_name_size, struct ecryptfs_mount_crypt_stat *mount_crypt_stat, const char *name, size_t name_size); -struct dentry *ecryptfs_lower_dentry(struct dentry *this_dentry); void ecryptfs_dump_hex(char *data, int bytes); int virt_to_scatterlist(const void *addr, int size, struct scatterlist *sg, int sg_size); diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c index 5f8f96da09fe..7929411837cf 100644 --- a/fs/ecryptfs/file.c +++ b/fs/ecryptfs/file.c @@ -33,13 +33,12 @@ static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb, struct iov_iter *to) { ssize_t rc; - const struct path *path; struct file *file = iocb->ki_filp; rc = generic_file_read_iter(iocb, to); if (rc >= 0) { - path = ecryptfs_dentry_to_lower_path(file->f_path.dentry); - touch_atime(path); + struct path path = ecryptfs_lower_path(file->f_path.dentry); + touch_atime(&path); } return rc; } @@ -59,12 +58,11 @@ static ssize_t ecryptfs_splice_read_update_atime(struct file *in, loff_t *ppos, size_t len, unsigned int flags) { ssize_t rc; - const struct path *path; rc = filemap_splice_read(in, ppos, pipe, len, flags); if (rc >= 0) { - path = ecryptfs_dentry_to_lower_path(in->f_path.dentry); - touch_atime(path); + struct path path = ecryptfs_lower_path(in->f_path.dentry); + touch_atime(&path); } return rc; } @@ -283,6 +281,7 @@ static int ecryptfs_dir_open(struct inode *inode, struct file *file) * ecryptfs_lookup() */ struct ecryptfs_file_info *file_info; struct file *lower_file; + struct path path; /* Released in ecryptfs_release or end of function if failure */ file_info = kmem_cache_zalloc(ecryptfs_file_info_cache, GFP_KERNEL); @@ -292,8 +291,8 @@ static int ecryptfs_dir_open(struct inode *inode, struct file *file) "Error attempting to allocate memory\n"); return -ENOMEM; } - lower_file = dentry_open(ecryptfs_dentry_to_lower_path(ecryptfs_dentry), - file->f_flags, current_cred()); + path = ecryptfs_lower_path(ecryptfs_dentry); + lower_file = dentry_open(&path, file->f_flags, current_cred()); if (IS_ERR(lower_file)) { printk(KERN_ERR "%s: Error attempting to initialize " "the lower file for the dentry with name " diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index 72fbe1316ab8..d2b262dc485d 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -327,24 +327,15 @@ static int ecryptfs_i_size_read(struct dentry *dentry, struct inode *inode) static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry, struct dentry *lower_dentry) { - const struct path *path = ecryptfs_dentry_to_lower_path(dentry->d_parent); + struct dentry *lower_parent = ecryptfs_dentry_to_lower(dentry->d_parent); struct inode *inode, *lower_inode; - struct ecryptfs_dentry_info *dentry_info; int rc = 0; - dentry_info = kmem_cache_alloc(ecryptfs_dentry_info_cache, GFP_KERNEL); - if (!dentry_info) { - dput(lower_dentry); - return ERR_PTR(-ENOMEM); - } - fsstack_copy_attr_atime(d_inode(dentry->d_parent), - d_inode(path->dentry)); + d_inode(lower_parent)); BUG_ON(!d_count(lower_dentry)); - ecryptfs_set_dentry_private(dentry, dentry_info); - dentry_info->lower_path.mnt = mntget(path->mnt); - dentry_info->lower_path.dentry = lower_dentry; + ecryptfs_set_dentry_lower(dentry, lower_dentry); /* * negative dentry can go positive under us here - its parent is not @@ -1022,10 +1013,10 @@ static int ecryptfs_getattr(struct mnt_idmap *idmap, { struct dentry *dentry = path->dentry; struct kstat lower_stat; + struct path lower_path = ecryptfs_lower_path(dentry); int rc; - rc = vfs_getattr_nosec(ecryptfs_dentry_to_lower_path(dentry), - &lower_stat, request_mask, flags); + rc = vfs_getattr_nosec(&lower_path, &lower_stat, request_mask, flags); if (!rc) { fsstack_copy_attr_all(d_inode(dentry), ecryptfs_inode_to_lower(d_inode(dentry))); diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c index eab1beb846d3..2afbcbbd9546 100644 --- a/fs/ecryptfs/main.c +++ b/fs/ecryptfs/main.c @@ -106,15 +106,14 @@ static int ecryptfs_init_lower_file(struct dentry *dentry, struct file **lower_file) { const struct cred *cred = current_cred(); - const struct path *path = ecryptfs_dentry_to_lower_path(dentry); + struct path path = ecryptfs_lower_path(dentry); int rc; - rc = ecryptfs_privileged_open(lower_file, path->dentry, path->mnt, - cred); + rc = ecryptfs_privileged_open(lower_file, path.dentry, path.mnt, cred); if (rc) { printk(KERN_ERR "Error opening lower file " "for lower_dentry [0x%p] and lower_mnt [0x%p]; " - "rc = [%d]\n", path->dentry, path->mnt, rc); + "rc = [%d]\n", path.dentry, path.mnt, rc); (*lower_file) = NULL; } return rc; @@ -437,7 +436,6 @@ static int ecryptfs_get_tree(struct fs_context *fc) struct ecryptfs_fs_context *ctx = fc->fs_private; struct ecryptfs_sb_info *sbi = fc->s_fs_info; struct ecryptfs_mount_crypt_stat *mount_crypt_stat; - struct ecryptfs_dentry_info *root_info; const char *err = "Getting sb failed"; struct inode *inode; struct path path; @@ -543,14 +541,8 @@ static int ecryptfs_get_tree(struct fs_context *fc) goto out_free; } - rc = -ENOMEM; - root_info = kmem_cache_zalloc(ecryptfs_dentry_info_cache, GFP_KERNEL); - if (!root_info) - goto out_free; - - /* ->kill_sb() will take care of root_info */ - ecryptfs_set_dentry_private(s->s_root, root_info); - root_info->lower_path = path; + ecryptfs_set_dentry_lower(s->s_root, path.dentry); + sbi->lower_mnt = path.mnt; s->s_flags |= SB_ACTIVE; fc->root = dget(s->s_root); @@ -580,6 +572,7 @@ static void ecryptfs_kill_block_super(struct super_block *sb) kill_anon_super(sb); if (!sb_info) return; + mntput(sb_info->lower_mnt); ecryptfs_destroy_mount_crypt_stat(&sb_info->mount_crypt_stat); kmem_cache_free(ecryptfs_sb_info_cache, sb_info); } @@ -667,11 +660,6 @@ static struct ecryptfs_cache_info { .name = "ecryptfs_file_cache", .size = sizeof(struct ecryptfs_file_info), }, - { - .cache = &ecryptfs_dentry_info_cache, - .name = "ecryptfs_dentry_info_cache", - .size = sizeof(struct ecryptfs_dentry_info), - }, { .cache = &ecryptfs_inode_info_cache, .name = "ecryptfs_inode_cache", -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 53/63] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (50 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 52/63] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 54/63] open_detached_copy(): don't bother with mount_lock_hash() Al Viro ` (9 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Comments regarding "shadow mounts" were stale - no such thing anymore. Document the locking requirements for __lookup_mnt(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 41 ++++++++++++----------------------------- 1 file changed, 12 insertions(+), 29 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index de9a88f45dc1..2e35f5eb4f81 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -825,24 +825,16 @@ static bool legitimize_mnt(struct vfsmount *bastard, unsigned seq) } /** - * __lookup_mnt - find first child mount + * __lookup_mnt - mount hash lookup * @mnt: parent mount - * @dentry: mountpoint + * @dentry: dentry of mountpoint * - * If @mnt has a child mount @c mounted @dentry find and return it. + * If @mnt has a child mount @c mounted on @dentry find and return it. + * Caller must either hold the spinlock component of @mount_lock or + * hold rcu_read_lock(), sample the seqcount component before the call + * and recheck it afterwards. * - * Note that the child mount @c need not be unique. There are cases - * where shadow mounts are created. For example, during mount - * propagation when a source mount @mnt whose root got overmounted by a - * mount @o after path lookup but before @namespace_sem could be - * acquired gets copied and propagated. So @mnt gets copied including - * @o. When @mnt is propagated to a destination mount @d that already - * has another mount @n mounted at the same mountpoint then the source - * mount @mnt will be tucked beneath @n, i.e., @n will be mounted on - * @mnt and @mnt mounted on @d. Now both @n and @o are mounted at @mnt - * on @dentry. - * - * Return: The first child of @mnt mounted @dentry or NULL. + * Return: The child of @mnt mounted on @dentry or %NULL. */ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) { @@ -855,21 +847,12 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) return NULL; } -/* - * lookup_mnt - Return the first child mount mounted at path - * - * "First" means first mounted chronologically. If you create the - * following mounts: - * - * mount /dev/sda1 /mnt - * mount /dev/sda2 /mnt - * mount /dev/sda3 /mnt - * - * Then lookup_mnt() on the base /mnt dentry in the root mount will - * return successively the root dentry and vfsmount of /dev/sda1, then - * /dev/sda2, then /dev/sda3, then NULL. +/** + * lookup_mnt - Return the child mount mounted at given location + * @path: location in the namespace * - * lookup_mnt takes a reference to the found vfsmount. + * Acquires and returns a new reference to mount at given location + * or %NULL if nothing is mounted there. */ struct vfsmount *lookup_mnt(const struct path *path) { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 54/63] open_detached_copy(): don't bother with mount_lock_hash() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (51 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 53/63] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-09-01 11:29 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 55/63] open_detached_copy(): separate creation of namespace into helper Al Viro ` (8 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds we are holding namespace_sem and a reference to root of tree; iterating through that tree does not need mount_lock. Neither does the insertion into the rbtree of new namespace or incrementing the mount count of that namespace. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 2e35f5eb4f81..425c33377770 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3086,14 +3086,12 @@ static struct file *open_detached_copy(struct path *path, bool recursive) return ERR_CAST(mnt); } - lock_mount_hash(); for (p = mnt; p; p = next_mnt(p, mnt)) { mnt_add_to_ns(ns, p); ns->nr_mounts++; } ns->root = mnt; mntget(&mnt->mnt); - unlock_mount_hash(); namespace_unlock(); mntput(path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 54/63] open_detached_copy(): don't bother with mount_lock_hash() 2025-08-28 23:07 ` [PATCH v2 54/63] open_detached_copy(): don't bother with mount_lock_hash() Al Viro @ 2025-09-01 11:29 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:29 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:57AM +0100, Al Viro wrote: > we are holding namespace_sem and a reference to root of tree; > iterating through that tree does not need mount_lock. Neither > does the insertion into the rbtree of new namespace or incrementing > the mount count of that namespace. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 55/63] open_detached_copy(): separate creation of namespace into helper 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (52 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 54/63] open_detached_copy(): don't bother with mount_lock_hash() Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-29 9:54 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro ` (7 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and convert the helper to use of a guard(namespace_excl) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 425c33377770..c324800e770c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3053,18 +3053,17 @@ static int do_loopback(const struct path *path, const char *old_name, return err; } -static struct file *open_detached_copy(struct path *path, bool recursive) +static struct mnt_namespace *get_detached_copy(const struct path *path, bool recursive) { struct mnt_namespace *ns, *mnt_ns = current->nsproxy->mnt_ns, *src_mnt_ns; struct user_namespace *user_ns = mnt_ns->user_ns; struct mount *mnt, *p; - struct file *file; ns = alloc_mnt_ns(user_ns, true); if (IS_ERR(ns)) - return ERR_CAST(ns); + return ns; - namespace_lock(); + guard(namespace_excl)(); /* * Record the sequence number of the source mount namespace. @@ -3081,8 +3080,7 @@ static struct file *open_detached_copy(struct path *path, bool recursive) mnt = __do_loopback(path, recursive); if (IS_ERR(mnt)) { - namespace_unlock(); - free_mnt_ns(ns); + emptied_ns = ns; return ERR_CAST(mnt); } @@ -3091,11 +3089,19 @@ static struct file *open_detached_copy(struct path *path, bool recursive) ns->nr_mounts++; } ns->root = mnt; - mntget(&mnt->mnt); - namespace_unlock(); + return ns; +} + +static struct file *open_detached_copy(struct path *path, bool recursive) +{ + struct mnt_namespace *ns = get_detached_copy(path, recursive); + struct file *file; + + if (IS_ERR(ns)) + return ERR_CAST(ns); mntput(path->mnt); - path->mnt = &mnt->mnt; + path->mnt = mntget(&ns->root->mnt); file = dentry_open(path, O_PATH, current_cred()); if (IS_ERR(file)) dissolve_on_fput(path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 55/63] open_detached_copy(): separate creation of namespace into helper 2025-08-28 23:07 ` [PATCH v2 55/63] open_detached_copy(): separate creation of namespace into helper Al Viro @ 2025-08-29 9:54 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-29 9:54 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:58AM +0100, Al Viro wrote: > ... and convert the helper to use of a guard(namespace_excl) > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (53 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 55/63] open_detached_copy(): separate creation of namespace into helper Al Viro @ 2025-08-28 23:07 ` Al Viro 2025-08-29 9:57 ` Christian Brauner 2025-08-28 23:08 ` [PATCH v2 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro ` (6 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:07 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Actual removal is done under the lock, but for checking if need to bother the lockless list_empty() is safe - either that namespace never had never been added to mnt_ns_tree, in which case the list will stay empty, or whoever had allocated it has called mnt_ns_tree_add() and it has already run to completion. After that point list_empty() will become false and will remain false, no matter what we do with the neighbors in mnt_ns_list. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index c324800e770c..daa72292ea58 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -195,7 +195,7 @@ static void mnt_ns_release_rcu(struct rcu_head *rcu) static void mnt_ns_tree_remove(struct mnt_namespace *ns) { /* remove from global mount namespace list */ - if (!is_anon_ns(ns)) { + if (!list_empty(&ns->mnt_ns_list)) { mnt_ns_tree_write_lock(); rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree); list_bidir_del_rcu(&ns->mnt_ns_list); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list 2025-08-28 23:07 ` [PATCH v2 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro @ 2025-08-29 9:57 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-29 9:57 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:07:59AM +0100, Al Viro wrote: > Actual removal is done under the lock, but for checking if need to bother > the lockless list_empty() is safe - either that namespace never had never nit: two "never"s > been added to mnt_ns_tree, in which case the list will stay empty, or > whoever had allocated it has called mnt_ns_tree_add() and it has already > run to completion. After that point list_empty() will become false and > will remain false, no matter what we do with the neighbors in mnt_ns_list. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> > fs/namespace.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index c324800e770c..daa72292ea58 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -195,7 +195,7 @@ static void mnt_ns_release_rcu(struct rcu_head *rcu) > static void mnt_ns_tree_remove(struct mnt_namespace *ns) > { > /* remove from global mount namespace list */ > - if (!is_anon_ns(ns)) { > + if (!list_empty(&ns->mnt_ns_list)) { > mnt_ns_tree_write_lock(); > rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree); > list_bidir_del_rcu(&ns->mnt_ns_list); > -- > 2.47.2 > ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (54 preceding siblings ...) 2025-08-28 23:07 ` [PATCH v2 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro @ 2025-08-28 23:08 ` Al Viro 2025-08-29 9:56 ` Christian Brauner 2025-08-28 23:08 ` [PATCH v2 58/63] copy_mnt_ns(): use guards Al Viro ` (5 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:08 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Now that free_mnt_ns() works prior to mnt_ns_tree_add(), there's no need for an open-coded analogue free_mnt_ns() there - yes, we do avoid one call_rcu() use per failing call of clone() or unshare(), if they fail due to OOM in that particular spot, but it's not really worth bothering. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index daa72292ea58..a418555586ef 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4190,10 +4190,8 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, copy_flags |= CL_SLAVE; new = copy_tree(old, old->mnt.mnt_root, copy_flags); if (IS_ERR(new)) { + emptied_ns = new_ns; namespace_unlock(); - ns_free_inum(&new_ns->ns); - dec_mnt_namespaces(new_ns->ucounts); - mnt_ns_release(new_ns); return ERR_CAST(new); } if (user_ns != ns->user_ns) { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure 2025-08-28 23:08 ` [PATCH v2 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro @ 2025-08-29 9:56 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-08-29 9:56 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:08:00AM +0100, Al Viro wrote: > Now that free_mnt_ns() works prior to mnt_ns_tree_add(), there's no need for > an open-coded analogue free_mnt_ns() there - yes, we do avoid one call_rcu() > use per failing call of clone() or unshare(), if they fail due to OOM in that > particular spot, but it's not really worth bothering. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 58/63] copy_mnt_ns(): use guards 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (55 preceding siblings ...) 2025-08-28 23:08 ` [PATCH v2 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro @ 2025-08-28 23:08 ` Al Viro 2025-09-01 11:43 ` Christian Brauner 2025-08-28 23:08 ` [PATCH v2 59/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro ` (4 subsequent siblings) 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:08 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds * mntput() of rootmnt and pwdmnt done via __free(mntput) * mnt_ns_tree_add() can be done within namespace_excl scope. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a418555586ef..9e16231d4561 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4164,7 +4164,8 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, struct user_namespace *user_ns, struct fs_struct *new_fs) { struct mnt_namespace *new_ns; - struct vfsmount *rootmnt = NULL, *pwdmnt = NULL; + struct vfsmount *rootmnt __free(mntput) = NULL; + struct vfsmount *pwdmnt __free(mntput) = NULL; struct mount *p, *q; struct mount *old; struct mount *new; @@ -4183,7 +4184,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, if (IS_ERR(new_ns)) return new_ns; - namespace_lock(); + guard(namespace_excl)(); /* First pass: copy the tree topology */ copy_flags = CL_COPY_UNBINDABLE | CL_EXPIRE; if (user_ns != ns->user_ns) @@ -4191,13 +4192,11 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, new = copy_tree(old, old->mnt.mnt_root, copy_flags); if (IS_ERR(new)) { emptied_ns = new_ns; - namespace_unlock(); return ERR_CAST(new); } if (user_ns != ns->user_ns) { - lock_mount_hash(); + guard(mount_writer)(); lock_mnt_tree(new); - unlock_mount_hash(); } new_ns->root = new; @@ -4229,14 +4228,6 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, while (p->mnt.mnt_root != q->mnt.mnt_root) p = next_mnt(skip_mnt_tree(p), old); } - namespace_unlock(); - - if (rootmnt) - mntput(rootmnt); - if (pwdmnt) - mntput(pwdmnt); - - mnt_ns_tree_add(new_ns); return new_ns; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 58/63] copy_mnt_ns(): use guards 2025-08-28 23:08 ` [PATCH v2 58/63] copy_mnt_ns(): use guards Al Viro @ 2025-09-01 11:43 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:43 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:08:01AM +0100, Al Viro wrote: > * mntput() of rootmnt and pwdmnt done via __free(mntput) > * mnt_ns_tree_add() can be done within namespace_excl scope. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/namespace.c | 17 ++++------------- > 1 file changed, 4 insertions(+), 13 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index a418555586ef..9e16231d4561 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -4164,7 +4164,8 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, > struct user_namespace *user_ns, struct fs_struct *new_fs) > { > struct mnt_namespace *new_ns; > - struct vfsmount *rootmnt = NULL, *pwdmnt = NULL; > + struct vfsmount *rootmnt __free(mntput) = NULL; > + struct vfsmount *pwdmnt __free(mntput) = NULL; > struct mount *p, *q; > struct mount *old; > struct mount *new; > @@ -4183,7 +4184,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, > if (IS_ERR(new_ns)) > return new_ns; > > - namespace_lock(); > + guard(namespace_excl)(); > /* First pass: copy the tree topology */ > copy_flags = CL_COPY_UNBINDABLE | CL_EXPIRE; > if (user_ns != ns->user_ns) > @@ -4191,13 +4192,11 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, > new = copy_tree(old, old->mnt.mnt_root, copy_flags); > if (IS_ERR(new)) { > emptied_ns = new_ns; > - namespace_unlock(); > return ERR_CAST(new); > } > if (user_ns != ns->user_ns) { > - lock_mount_hash(); > + guard(mount_writer)(); > lock_mnt_tree(new); > - unlock_mount_hash(); > } > new_ns->root = new; > > @@ -4229,14 +4228,6 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, > while (p->mnt.mnt_root != q->mnt.mnt_root) > p = next_mnt(skip_mnt_tree(p), old); > } > - namespace_unlock(); > - > - if (rootmnt) > - mntput(rootmnt); > - if (pwdmnt) > - mntput(pwdmnt); > - > - mnt_ns_tree_add(new_ns); The commit message states that "mnt_ns_tree_add() can be done within namespace_excl scope" suggesting that all this does is to widen the scope of the lock. But this change also removes the call to mnt_ns_tree_add() completely? Intentional? ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 59/63] setup_mnt(): primitive for connecting a mount to filesystem 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (56 preceding siblings ...) 2025-08-28 23:08 ` [PATCH v2 58/63] copy_mnt_ns(): use guards Al Viro @ 2025-08-28 23:08 ` Al Viro 2025-08-28 23:08 ` [PATCH v2 60/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro ` (3 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:08 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Take the identical logics in vfs_create_mount() and clone_mnt() into a new helper that takes an empty struct mount and attaches it to given dentry (sub)tree. Should be called once in the lifetime of every mount, prior to making it visible in any data structures. After that point ->mnt_root and ->mnt_sb never change; ->mnt_root is a counting reference to dentry and ->mnt_sb - an active reference to superblock. Mount remains associated with that dentry tree all the way until the call of cleanup_mnt(), when the refcount eventually drops to zero. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9e16231d4561..5af609ff43bc 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1195,6 +1195,21 @@ static void commit_tree(struct mount *mnt) touch_mnt_namespace(n); } +static void setup_mnt(struct mount *m, struct dentry *root) +{ + struct super_block *s = root->d_sb; + + atomic_inc(&s->s_active); + m->mnt.mnt_sb = s; + m->mnt.mnt_root = dget(root); + m->mnt_mountpoint = m->mnt.mnt_root; + m->mnt_parent = m; + + lock_mount_hash(); + list_add_tail(&m->mnt_instance, &s->s_mounts); + unlock_mount_hash(); +} + /** * vfs_create_mount - Create a mount for a configured superblock * @fc: The configuration context with the superblock attached @@ -1218,15 +1233,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc) if (fc->sb_flags & SB_KERNMOUNT) mnt->mnt.mnt_flags = MNT_INTERNAL; - atomic_inc(&fc->root->d_sb->s_active); - mnt->mnt.mnt_sb = fc->root->d_sb; - mnt->mnt.mnt_root = dget(fc->root); - mnt->mnt_mountpoint = mnt->mnt.mnt_root; - mnt->mnt_parent = mnt; + setup_mnt(mnt, fc->root); - lock_mount_hash(); - list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts); - unlock_mount_hash(); return &mnt->mnt; } EXPORT_SYMBOL(vfs_create_mount); @@ -1284,7 +1292,6 @@ EXPORT_SYMBOL_GPL(vfs_kern_mount); static struct mount *clone_mnt(struct mount *old, struct dentry *root, int flag) { - struct super_block *sb = old->mnt.mnt_sb; struct mount *mnt; int err; @@ -1309,16 +1316,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, if (mnt->mnt_group_id) set_mnt_shared(mnt); - atomic_inc(&sb->s_active); mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt)); - mnt->mnt.mnt_sb = sb; - mnt->mnt.mnt_root = dget(root); - mnt->mnt_mountpoint = mnt->mnt.mnt_root; - mnt->mnt_parent = mnt; - lock_mount_hash(); - list_add_tail(&mnt->mnt_instance, &sb->s_mounts); - unlock_mount_hash(); + setup_mnt(mnt, root); if (flag & CL_PRIVATE) // we are done with it return mnt; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 60/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (57 preceding siblings ...) 2025-08-28 23:08 ` [PATCH v2 59/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro @ 2025-08-28 23:08 ` Al Viro 2025-08-28 23:08 ` [PATCH v2 61/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro ` (2 subsequent siblings) 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:08 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds We have an unpleasant wart in accessibility rules for struct mount. There are per-superblock lists of mounts, used by sb_prepare_remount_readonly() to check if any of those is currently claimed for write access and to block further attempts to get write access on those until we are done. As soon as it is attached to a filesystem, mount becomes reachable via that list. Only sb_prepare_remount_readonly() traverses it and it only accesses a few members of struct mount. Unfortunately, ->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set and then cleared. It is done under mount_lock, so from the locking rules POV everything's fine. However, it has easily overlooked implications - once mount has been attached to a filesystem, it has to be treated as globally visible. In particular, initializing ->mnt_flags *must* be done either prior to that point or under mount_lock. All other members are still private at that point. Life gets simpler if we move that bit (and that's *all* that can get touched by access via this list) out of ->mnt_flags. It's not even hard to do - currently the list is implemented as list_head one, anchored in super_block->s_mounts and linked via mount->mnt_instance. As the first step, switch it to hlist-like open-coded structure - address of the first mount in the set is stored in ->s_mounts and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb - the former either NULL or pointing to the next mount in set, the latter - address of either ->s_mounts or ->mnt_next_for_sb in the previous element of the set. In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as replacement for MNT_WRITE_HOLD. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 3 ++- fs/namespace.c | 38 +++++++++++++++++++++++++++++--------- fs/super.c | 3 +-- include/linux/fs.h | 4 +++- 4 files changed, 35 insertions(+), 13 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index 04d0eadc4c10..5c2ddcff810c 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -64,7 +64,8 @@ struct mount { #endif struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ - struct list_head mnt_instance; /* mount instance on sb->s_mounts */ + struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */ + struct mount **mnt_pprev_for_sb;/* except that LSB of pprev will be stolen */ const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; struct list_head mnt_expire; /* link in fs-specific expiry list */ diff --git a/fs/namespace.c b/fs/namespace.c index 5af609ff43bc..120854639dd2 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -729,6 +729,27 @@ static inline void mnt_unhold_writers(struct mount *mnt) mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; } +static inline void mnt_del_instance(struct mount *m) +{ + struct mount **p = m->mnt_pprev_for_sb; + struct mount *next = m->mnt_next_for_sb; + + if (next) + next->mnt_pprev_for_sb = p; + *p = next; +} + +static inline void mnt_add_instance(struct mount *m, struct super_block *s) +{ + struct mount *first = s->s_mounts; + + if (first) + first->mnt_pprev_for_sb = &m->mnt_next_for_sb; + m->mnt_next_for_sb = first; + m->mnt_pprev_for_sb = &s->s_mounts; + s->s_mounts = m; +} + static int mnt_make_readonly(struct mount *mnt) { int ret; @@ -742,7 +763,6 @@ static int mnt_make_readonly(struct mount *mnt) int sb_prepare_remount_readonly(struct super_block *sb) { - struct mount *mnt; int err = 0; /* Racy optimization. Recheck the counter under MNT_WRITE_HOLD */ @@ -750,9 +770,9 @@ int sb_prepare_remount_readonly(struct super_block *sb) return -EBUSY; lock_mount_hash(); - list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { - if (!(mnt->mnt.mnt_flags & MNT_READONLY)) { - err = mnt_hold_writers(mnt); + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { + if (!(m->mnt.mnt_flags & MNT_READONLY)) { + err = mnt_hold_writers(m); if (err) break; } @@ -762,9 +782,9 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err) sb_start_ro_state_change(sb); - list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { - if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { + if (m->mnt.mnt_flags & MNT_WRITE_HOLD) + m->mnt.mnt_flags &= ~MNT_WRITE_HOLD; } unlock_mount_hash(); @@ -1206,7 +1226,7 @@ static void setup_mnt(struct mount *m, struct dentry *root) m->mnt_parent = m; lock_mount_hash(); - list_add_tail(&m->mnt_instance, &s->s_mounts); + mnt_add_instance(m, s); unlock_mount_hash(); } @@ -1424,7 +1444,7 @@ static void mntput_no_expire(struct mount *mnt) mnt->mnt.mnt_flags |= MNT_DOOMED; rcu_read_unlock(); - list_del(&mnt->mnt_instance); + mnt_del_instance(mnt); if (unlikely(!list_empty(&mnt->mnt_expire))) list_del(&mnt->mnt_expire); diff --git a/fs/super.c b/fs/super.c index 7f876f32343a..3b0f49e1b817 100644 --- a/fs/super.c +++ b/fs/super.c @@ -323,7 +323,6 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, if (!s) return NULL; - INIT_LIST_HEAD(&s->s_mounts); s->s_user_ns = get_user_ns(user_ns); init_rwsem(&s->s_umount); lockdep_set_class(&s->s_umount, &type->s_umount_key); @@ -408,7 +407,7 @@ static void __put_super(struct super_block *s) list_del_init(&s->s_list); WARN_ON(s->s_dentry_lru.node); WARN_ON(s->s_inode_lru.node); - WARN_ON(!list_empty(&s->s_mounts)); + WARN_ON(s->s_mounts); call_rcu(&s->rcu, destroy_super_rcu); } } diff --git a/include/linux/fs.h b/include/linux/fs.h index d7ab4f96d705..0e9c7f1460dc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1324,6 +1324,8 @@ struct sb_writers { struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; }; +struct mount; + struct super_block { struct list_head s_list; /* Keep this first */ dev_t s_dev; /* search index; _not_ kdev_t */ @@ -1358,7 +1360,7 @@ struct super_block { __u16 s_encoding_flags; #endif struct hlist_bl_head s_roots; /* alternate root dentries for NFS */ - struct list_head s_mounts; /* list of mounts; _not_ for fs use */ + struct mount *s_mounts; /* list of mounts; _not_ for fs use */ struct block_device *s_bdev; /* can go away once we use an accessor for @s_bdev_file */ struct file *s_bdev_file; struct backing_dev_info *s_bdi; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 61/63] struct mount: relocate MNT_WRITE_HOLD bit 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (58 preceding siblings ...) 2025-08-28 23:08 ` [PATCH v2 60/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro @ 2025-08-28 23:08 ` Al Viro 2025-08-28 23:31 ` Linus Torvalds 2025-08-28 23:08 ` [PATCH v2 62/63] simplify the callers of mnt_unhold_writers() Al Viro 2025-08-28 23:08 ` [PATCH v2 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount Al Viro 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:08 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... from ->mnt_flags to LSB of ->mnt_pprev_for_sb. This is safe - we always set and clear it within the same mount_lock scope, so we won't interfere with list operations - traversals are always forward, so they don't even look at ->mnt_prev_for_sb and both insertions and removals are in mount_lock scopes of their own, so that bit will be clear in *all* mount instances during those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 3 ++- fs/namespace.c | 50 +++++++++++++++++++++---------------------- include/linux/fs.h | 4 +--- include/linux/mount.h | 3 +-- 4 files changed, 29 insertions(+), 31 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index 5c2ddcff810c..c13bbd93d837 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -65,7 +65,8 @@ struct mount { struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */ - struct mount **mnt_pprev_for_sb;/* except that LSB of pprev will be stolen */ + unsigned long mnt_pprev_for_sb; /* except that LSB of pprev is stolen */ +#define WRITE_HOLD 1 /* ... for use by mnt_hold_writers() */ const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; struct list_head mnt_expire; /* link in fs-specific expiry list */ diff --git a/fs/namespace.c b/fs/namespace.c index 120854639dd2..f9c9c69a815b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -509,20 +509,20 @@ int mnt_get_write_access(struct vfsmount *m) mnt_inc_writers(mnt); /* * The store to mnt_inc_writers must be visible before we pass - * MNT_WRITE_HOLD loop below, so that the slowpath can see our - * incremented count after it has set MNT_WRITE_HOLD. + * WRITE_HOLD loop below, so that the slowpath can see our + * incremented count after it has set WRITE_HOLD. */ smp_mb(); might_lock(&mount_lock.lock); - while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) { + while (READ_ONCE(mnt->mnt_pprev_for_sb) & WRITE_HOLD) { if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { cpu_relax(); } else { /* * This prevents priority inversion, if the task - * setting MNT_WRITE_HOLD got preempted on a remote + * setting WRITE_HOLD got preempted on a remote * CPU, and it prevents life lock if the task setting - * MNT_WRITE_HOLD has a lower priority and is bound to + * WRITE_HOLD has a lower priority and is bound to * the same CPU as the task that is spinning here. */ preempt_enable(); @@ -533,7 +533,7 @@ int mnt_get_write_access(struct vfsmount *m) } /* * The barrier pairs with the barrier sb_start_ro_state_change() making - * sure that if we see MNT_WRITE_HOLD cleared, we will also see + * sure that if we see WRITE_HOLD cleared, we will also see * s_readonly_remount set (or even SB_RDONLY / MNT_READONLY flags) in * mnt_is_readonly() and bail in case we are racing with remount * read-only. @@ -672,15 +672,15 @@ EXPORT_SYMBOL(mnt_drop_write_file); * @mnt. * * Context: This function expects lock_mount_hash() to be held serializing - * setting MNT_WRITE_HOLD. + * setting WRITE_HOLD. * Return: On success 0 is returned. * On error, -EBUSY is returned. */ static inline int mnt_hold_writers(struct mount *mnt) { - mnt->mnt.mnt_flags |= MNT_WRITE_HOLD; + mnt->mnt_pprev_for_sb |= WRITE_HOLD; /* - * After storing MNT_WRITE_HOLD, we'll read the counters. This store + * After storing WRITE_HOLD, we'll read the counters. This store * should be visible before we do. */ smp_mb(); @@ -696,9 +696,9 @@ static inline int mnt_hold_writers(struct mount *mnt) * sum up each counter, if we read a counter before it is incremented, * but then read another CPU's count which it has been subsequently * decremented from -- we would see more decrements than we should. - * MNT_WRITE_HOLD protects against this scenario, because + * WRITE_HOLD protects against this scenario, because * mnt_want_write first increments count, then smp_mb, then spins on - * MNT_WRITE_HOLD, so it can't be decremented by another CPU while + * WRITE_HOLD, so it can't be decremented by another CPU while * we're counting up here. */ if (mnt_get_writers(mnt) > 0) @@ -722,20 +722,20 @@ static inline int mnt_hold_writers(struct mount *mnt) static inline void mnt_unhold_writers(struct mount *mnt) { /* - * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers + * MNT_READONLY must become visible before ~WRITE_HOLD, so writers * that become unheld will see MNT_READONLY. */ smp_wmb(); - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + mnt->mnt_pprev_for_sb &= ~WRITE_HOLD; } static inline void mnt_del_instance(struct mount *m) { - struct mount **p = m->mnt_pprev_for_sb; + struct mount **p = (void *)m->mnt_pprev_for_sb; struct mount *next = m->mnt_next_for_sb; if (next) - next->mnt_pprev_for_sb = p; + next->mnt_pprev_for_sb = (unsigned long)p; *p = next; } @@ -744,9 +744,9 @@ static inline void mnt_add_instance(struct mount *m, struct super_block *s) struct mount *first = s->s_mounts; if (first) - first->mnt_pprev_for_sb = &m->mnt_next_for_sb; + first->mnt_pprev_for_sb = (unsigned long)&m->mnt_next_for_sb; m->mnt_next_for_sb = first; - m->mnt_pprev_for_sb = &s->s_mounts; + m->mnt_pprev_for_sb = (unsigned long)&s->s_mounts; s->s_mounts = m; } @@ -765,7 +765,7 @@ int sb_prepare_remount_readonly(struct super_block *sb) { int err = 0; - /* Racy optimization. Recheck the counter under MNT_WRITE_HOLD */ + /* Racy optimization. Recheck the counter under WRITE_HOLD */ if (atomic_long_read(&sb->s_remove_count)) return -EBUSY; @@ -783,8 +783,8 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err) sb_start_ro_state_change(sb); for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { - if (m->mnt.mnt_flags & MNT_WRITE_HOLD) - m->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + if (m->mnt_pprev_for_sb & WRITE_HOLD) + m->mnt_pprev_for_sb &= ~WRITE_HOLD; } unlock_mount_hash(); @@ -4805,18 +4805,18 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) struct mount *p; /* - * If we had to call mnt_hold_writers() MNT_WRITE_HOLD will - * be set in @mnt_flags. The loop unsets MNT_WRITE_HOLD for all + * If we had to call mnt_hold_writers() WRITE_HOLD will + * be set in @mnt_flags. The loop unsets WRITE_HOLD for all * mounts and needs to take care to include the first mount. */ for (p = mnt; p; p = next_mnt(p, mnt)) { /* If we had to hold writers unblock them. */ - if (p->mnt.mnt_flags & MNT_WRITE_HOLD) + if (p->mnt_pprev_for_sb & WRITE_HOLD) mnt_unhold_writers(p); /* * We're done once the first mount we changed got - * MNT_WRITE_HOLD unset. + * WRITE_HOLD unset. */ if (p == m) break; @@ -4851,7 +4851,7 @@ static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt) WRITE_ONCE(m->mnt.mnt_flags, flags); /* If we had to hold writers unblock them. */ - if (m->mnt.mnt_flags & MNT_WRITE_HOLD) + if (mnt->mnt_pprev_for_sb & WRITE_HOLD) mnt_unhold_writers(m); if (kattr->propagation) diff --git a/include/linux/fs.h b/include/linux/fs.h index 0e9c7f1460dc..1d583f38fb81 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1324,8 +1324,6 @@ struct sb_writers { struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; }; -struct mount; - struct super_block { struct list_head s_list; /* Keep this first */ dev_t s_dev; /* search index; _not_ kdev_t */ @@ -1360,7 +1358,7 @@ struct super_block { __u16 s_encoding_flags; #endif struct hlist_bl_head s_roots; /* alternate root dentries for NFS */ - struct mount *s_mounts; /* list of mounts; _not_ for fs use */ + void *s_mounts; /* list of mounts; _not_ for fs use */ struct block_device *s_bdev; /* can go away once we use an accessor for @s_bdev_file */ struct file *s_bdev_file; struct backing_dev_info *s_bdi; diff --git a/include/linux/mount.h b/include/linux/mount.h index 18e4b97f8a98..85e97b9340ff 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -33,7 +33,6 @@ enum mount_flags { MNT_NOSYMFOLLOW = 0x80, MNT_SHRINKABLE = 0x100, - MNT_WRITE_HOLD = 0x200, MNT_INTERNAL = 0x4000, @@ -52,7 +51,7 @@ enum mount_flags { | MNT_READONLY | MNT_NOSYMFOLLOW, MNT_ATIME_MASK = MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME, - MNT_INTERNAL_FLAGS = MNT_WRITE_HOLD | MNT_INTERNAL | MNT_DOOMED | + MNT_INTERNAL_FLAGS = MNT_INTERNAL | MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_LOCKED }; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 61/63] struct mount: relocate MNT_WRITE_HOLD bit 2025-08-28 23:08 ` [PATCH v2 61/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro @ 2025-08-28 23:31 ` Linus Torvalds 2025-08-29 0:11 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-08-28 23:31 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, brauner, jack On Thu, 28 Aug 2025 at 16:08, Al Viro <viro@zeniv.linux.org.uk> wrote: > > ... from ->mnt_flags to LSB of ->mnt_pprev_for_sb. Ugh. This one I'm not happy with. The random new casts: > static inline void mnt_del_instance(struct mount *m) > { > - struct mount **p = m->mnt_pprev_for_sb; > + struct mount **p = (void *)m->mnt_pprev_for_sb; > struct mount *next = m->mnt_next_for_sb; > > if (next) > - next->mnt_pprev_for_sb = p; > + next->mnt_pprev_for_sb = (unsigned long)p; > *p = next; > } are just nasty. And it's there in multiple places (ie mnt_add_instance() has more of them). Making things even *worse*, the other case you changed (s_mounts) it's a "void *", which means that it does *not* have casts in other places, and you still do things like for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { so that 's_mounts' thing is just silently cast from a untyped 'void *' to the 'struct mount *' that it used to be. So no - this is *not* acceptable. Same largely goes for that > - struct mount **mnt_pprev_for_sb;/* except that LSB of pprev will be stolen */ > + unsigned long mnt_pprev_for_sb; /* except that LSB of pprev is stolen */ change, but at least there it's now a 'unsigned long', so it will *always* complain if a cast is missing in either direction. That's better, but still horrendously ugly. If you want to use an opaque type, then please make it be truly opaque. Not 'unsigned long'. And certainly not 'void *'. Make it be something that is still type-safe - you can make up a pointer to struct name that is never actually declared, so that it's basically a unique type (or two separate types for mnt_pprev_for_sb and I'm not even clear on why you did this change, but if you want to have specific types for some reason, make them *really* specific. Don't make them 'void *', and 'unsigned long'. Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH v2 61/63] struct mount: relocate MNT_WRITE_HOLD bit 2025-08-28 23:31 ` Linus Torvalds @ 2025-08-29 0:11 ` Al Viro 2025-08-29 0:35 ` Linus Torvalds 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-29 0:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-fsdevel, brauner, jack On Thu, Aug 28, 2025 at 04:31:56PM -0700, Linus Torvalds wrote: > Same largely goes for that > > > - struct mount **mnt_pprev_for_sb;/* except that LSB of pprev will be stolen */ > > + unsigned long mnt_pprev_for_sb; /* except that LSB of pprev is stolen */ > > change, but at least there it's now a 'unsigned long', so it will > *always* complain if a cast is missing in either direction. That's > better, but still horrendously ugly. > > If you want to use an opaque type, then please make it be truly > opaque. Not 'unsigned long'. And certainly not 'void *'. Make it be > something that is still type-safe - you can make up a pointer to > struct name that is never actually declared, so that it's basically a > unique type (or two separate types for mnt_pprev_for_sb and > > I'm not even clear on why you did this change, but if you want to have > specific types for some reason, make them *really* specific. Don't > make them 'void *', and 'unsigned long'. What I want to avoid is compiler seeing something like (unsigned long)READ_ONCE(m->mnt_pprev_for_sb) & 1 and going "that thing is a pointer to struct mount *, either the address is even or it's an undefined behaviour and I can do whatever I want anyway; optimize it to 0". unsigned long is a brute-force way to avoid that - it avoids UB (OK, avoids it as long as no struct mount instance has an odd address), so compiler can't start playing silly buggers. If you have a prettier approach, I'd like to hear it - I obviously do not enjoy the way this one looks. ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH v2 61/63] struct mount: relocate MNT_WRITE_HOLD bit 2025-08-29 0:11 ` Al Viro @ 2025-08-29 0:35 ` Linus Torvalds 2025-08-29 6:03 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-08-29 0:35 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, brauner, jack On Thu, 28 Aug 2025 at 17:11, Al Viro <viro@zeniv.linux.org.uk> wrote: > > What I want to avoid is compiler seeing something like > (unsigned long)READ_ONCE(m->mnt_pprev_for_sb) & 1 > and going "that thing is a pointer to struct mount *, either the address > is even or it's an undefined behaviour and I can do whatever I want > anyway; optimize it to 0". Have you actually seen that? Because if some compiler does this, we have tons of other places that will hit this, and we'll need to try to figure out some generic solution, or - more likely - just disable said compiler "optimization". And if you really want to deal with this theoretical issue, please just use a union for it, having both the proper pointer type and the 'unsigned long', and using the appropriate field instead of any type casts. Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCH v2 61/63] struct mount: relocate MNT_WRITE_HOLD bit 2025-08-29 0:35 ` Linus Torvalds @ 2025-08-29 6:03 ` Al Viro 2025-08-29 6:04 ` [59/63] simplify the callers of mnt_unhold_writers() Al Viro ` (3 more replies) 0 siblings, 4 replies; 321+ messages in thread From: Al Viro @ 2025-08-29 6:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-fsdevel, brauner, jack On Thu, Aug 28, 2025 at 05:35:26PM -0700, Linus Torvalds wrote: > On Thu, 28 Aug 2025 at 17:11, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > What I want to avoid is compiler seeing something like > > (unsigned long)READ_ONCE(m->mnt_pprev_for_sb) & 1 > > and going "that thing is a pointer to struct mount *, either the address > > is even or it's an undefined behaviour and I can do whatever I want > > anyway; optimize it to 0". > > Have you actually seen that? Because if some compiler does this, we > have tons of other places that will hit this, and we'll need to try to > figure out some generic solution, or - more likely - just disable said > compiler "optimization". list_bl.h being an obvious victim... OK, convinced. No, I hadn't seen that, and I agree that we'll get some very visible breakage if that ever happens. Anyway, I think I've come up with a trick that would be proof against that kind of idiocy: struct mount *__aligned(1) *mnt_pprev_for_sb; IOW, tell compiler that this member contains a pointer to a possibly unaligned object containing a pointer to struct mount. Since the member is declared as pointer to unaligned object, compiler is not allowed to make any assumptions about the LSB of its value. For any type T, we are fine with T __aligned(1) *p; ... T *q = p; and as long as the value of p is actually aligned, no nasal daemons should fly. Since we never dereference them directly ('add' doesn't dereference them at all, 'del' copies to local struct mount ** and dereferences that), all generated memory accesses will be aligned ones. Since the only values we'll ever assign to that member will be addresses of normally aligned objects, we should be fine. Sure, __attribute__((__aligned__(...))) is not standard, but AFAICS we should not step into any UB in a compiler implementing it... Replacements for the 59..62 in followups (I've reordered them - easier that way). ^ permalink raw reply [flat|nested] 321+ messages in thread
* [59/63] simplify the callers of mnt_unhold_writers() 2025-08-29 6:03 ` Al Viro @ 2025-08-29 6:04 ` Al Viro 2025-09-01 11:20 ` Christian Brauner 2025-08-29 6:05 ` [60/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro ` (2 subsequent siblings) 3 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-29 6:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-fsdevel, brauner, jack The logics in cleanup on failure in mount_setattr_prepare() is simplified by having the mnt_hold_writers() failure followed by advancing m to the next node in the tree before leaving the loop. And since all calls are preceded by the same check that flag has been set and the function is inlined, let's just shift the check into it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 34 ++++++++++------------------------ 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9e16231d4561..d8df1046e2f9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -714,13 +714,14 @@ static inline int mnt_hold_writers(struct mount *mnt) * Stop preventing write access to @mnt allowing callers to gain write access * to @mnt again. * - * This function can only be called after a successful call to - * mnt_hold_writers(). + * This function can only be called after a call to mnt_hold_writers(). * * Context: This function expects lock_mount_hash() to be held. */ static inline void mnt_unhold_writers(struct mount *mnt) { + if (!(mnt->mnt_flags & MNT_WRITE_HOLD)) + return; /* * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers * that become unheld will see MNT_READONLY. @@ -4773,8 +4774,10 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) if (!mnt_allow_writers(kattr, m)) { err = mnt_hold_writers(m); - if (err) + if (err) { + m = next_mnt(m, mnt); break; + } } if (!(kattr->kflags & MOUNT_KATTR_RECURSE)) @@ -4782,25 +4785,9 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) } if (err) { - struct mount *p; - - /* - * If we had to call mnt_hold_writers() MNT_WRITE_HOLD will - * be set in @mnt_flags. The loop unsets MNT_WRITE_HOLD for all - * mounts and needs to take care to include the first mount. - */ - for (p = mnt; p; p = next_mnt(p, mnt)) { - /* If we had to hold writers unblock them. */ - if (p->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt_unhold_writers(p); - - /* - * We're done once the first mount we changed got - * MNT_WRITE_HOLD unset. - */ - if (p == m) - break; - } + /* undo all mnt_hold_writers() we'd done */ + for (struct mount *p = mnt; p != m; p = next_mnt(p, mnt)) + mnt_unhold_writers(p); } return err; } @@ -4831,8 +4818,7 @@ static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt) WRITE_ONCE(m->mnt.mnt_flags, flags); /* If we had to hold writers unblock them. */ - if (m->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt_unhold_writers(m); + mnt_unhold_writers(m); if (kattr->propagation) change_mnt_propagation(m, kattr->propagation); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [59/63] simplify the callers of mnt_unhold_writers() 2025-08-29 6:04 ` [59/63] simplify the callers of mnt_unhold_writers() Al Viro @ 2025-09-01 11:20 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:20 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, jack On Fri, Aug 29, 2025 at 07:04:36AM +0100, Al Viro wrote: > The logics in cleanup on failure in mount_setattr_prepare() is simplified > by having the mnt_hold_writers() failure followed by advancing m to the > next node in the tree before leaving the loop. > > And since all calls are preceded by the same check that flag has been set > and the function is inlined, let's just shift the check into it. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [60/63] setup_mnt(): primitive for connecting a mount to filesystem 2025-08-29 6:03 ` Al Viro 2025-08-29 6:04 ` [59/63] simplify the callers of mnt_unhold_writers() Al Viro @ 2025-08-29 6:05 ` Al Viro 2025-08-29 9:59 ` Christian Brauner 2025-09-01 11:17 ` [60/63] setup_mnt(): primitive for connecting a mount to filesystem Christian Brauner 2025-08-29 6:06 ` [61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro 2025-08-29 6:07 ` [62/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro 3 siblings, 2 replies; 321+ messages in thread From: Al Viro @ 2025-08-29 6:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-fsdevel, brauner, jack Take the identical logics in vfs_create_mount() and clone_mnt() into a new helper that takes an empty struct mount and attaches it to given dentry (sub)tree. Should be called once in the lifetime of every mount, prior to making it visible in any data structures. After that point ->mnt_root and ->mnt_sb never change; ->mnt_root is a counting reference to dentry and ->mnt_sb - an active reference to superblock. Mount remains associated with that dentry tree all the way until the call of cleanup_mnt(), when the refcount eventually drops to zero. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index d8df1046e2f9..c769fc4051e0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1196,6 +1196,21 @@ static void commit_tree(struct mount *mnt) touch_mnt_namespace(n); } +static void setup_mnt(struct mount *m, struct dentry *root) +{ + struct super_block *s = root->d_sb; + + atomic_inc(&s->s_active); + m->mnt.mnt_sb = s; + m->mnt.mnt_root = dget(root); + m->mnt_mountpoint = m->mnt.mnt_root; + m->mnt_parent = m; + + lock_mount_hash(); + list_add_tail(&m->mnt_instance, &s->s_mounts); + unlock_mount_hash(); +} + /** * vfs_create_mount - Create a mount for a configured superblock * @fc: The configuration context with the superblock attached @@ -1219,15 +1234,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc) if (fc->sb_flags & SB_KERNMOUNT) mnt->mnt.mnt_flags = MNT_INTERNAL; - atomic_inc(&fc->root->d_sb->s_active); - mnt->mnt.mnt_sb = fc->root->d_sb; - mnt->mnt.mnt_root = dget(fc->root); - mnt->mnt_mountpoint = mnt->mnt.mnt_root; - mnt->mnt_parent = mnt; + setup_mnt(mnt, fc->root); - lock_mount_hash(); - list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts); - unlock_mount_hash(); return &mnt->mnt; } EXPORT_SYMBOL(vfs_create_mount); @@ -1285,7 +1293,6 @@ EXPORT_SYMBOL_GPL(vfs_kern_mount); static struct mount *clone_mnt(struct mount *old, struct dentry *root, int flag) { - struct super_block *sb = old->mnt.mnt_sb; struct mount *mnt; int err; @@ -1310,16 +1317,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, if (mnt->mnt_group_id) set_mnt_shared(mnt); - atomic_inc(&sb->s_active); mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt)); - mnt->mnt.mnt_sb = sb; - mnt->mnt.mnt_root = dget(root); - mnt->mnt_mountpoint = mnt->mnt.mnt_root; - mnt->mnt_parent = mnt; - lock_mount_hash(); - list_add_tail(&mnt->mnt_instance, &sb->s_mounts); - unlock_mount_hash(); + setup_mnt(mnt, root); if (flag & CL_PRIVATE) // we are done with it return mnt; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [60/63] setup_mnt(): primitive for connecting a mount to filesystem 2025-08-29 6:05 ` [60/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro @ 2025-08-29 9:59 ` Christian Brauner 2025-08-29 16:37 ` Al Viro 2025-09-01 11:17 ` [60/63] setup_mnt(): primitive for connecting a mount to filesystem Christian Brauner 1 sibling, 1 reply; 321+ messages in thread From: Christian Brauner @ 2025-08-29 9:59 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, jack On Fri, Aug 29, 2025 at 07:05:22AM +0100, Al Viro wrote: > Take the identical logics in vfs_create_mount() and clone_mnt() into > a new helper that takes an empty struct mount and attaches it to > given dentry (sub)tree. > > Should be called once in the lifetime of every mount, prior to making > it visible in any data structures. > > After that point ->mnt_root and ->mnt_sb never change; ->mnt_root > is a counting reference to dentry and ->mnt_sb - an active reference > to superblock. > > Mount remains associated with that dentry tree all the way until > the call of cleanup_mnt(), when the refcount eventually drops > to zero. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Is this supposed to be the v3? I'm confused what I need to be looking at since it's a reply to v2 and some earlier review comments... ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [60/63] setup_mnt(): primitive for connecting a mount to filesystem 2025-08-29 9:59 ` Christian Brauner @ 2025-08-29 16:37 ` Al Viro 2025-08-30 4:36 ` Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-29 16:37 UTC (permalink / raw) To: Christian Brauner; +Cc: Linus Torvalds, linux-fsdevel, jack On Fri, Aug 29, 2025 at 11:59:55AM +0200, Christian Brauner wrote: > On Fri, Aug 29, 2025 at 07:05:22AM +0100, Al Viro wrote: > > Take the identical logics in vfs_create_mount() and clone_mnt() into > > a new helper that takes an empty struct mount and attaches it to > > given dentry (sub)tree. > > > > Should be called once in the lifetime of every mount, prior to making > > it visible in any data structures. > > > > After that point ->mnt_root and ->mnt_sb never change; ->mnt_root > > is a counting reference to dentry and ->mnt_sb - an active reference > > to superblock. > > > > Mount remains associated with that dentry tree all the way until > > the call of cleanup_mnt(), when the refcount eventually drops > > to zero. > > > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > --- > > Is this supposed to be the v3? I'm confused what I need to be looking > at since it's a reply to v2 and some earlier review comments... It would be in v3, but I didn't feel like sending another 63-patch mailbomb for the sake of these 4 changed commits (well, and a cosmetical change in #33, with matching modification in #35, ending with both being cleaner - with the same resulting tree after #35). These 4 do repace #59..#62 in v3. ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [60/63] setup_mnt(): primitive for connecting a mount to filesystem 2025-08-29 16:37 ` Al Viro @ 2025-08-30 4:36 ` Al Viro 2025-08-30 7:33 ` [RFC] does # really need to be escaped in devnames? Al Viro 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-30 4:36 UTC (permalink / raw) To: Christian Brauner; +Cc: Linus Torvalds, linux-fsdevel, jack On Fri, Aug 29, 2025 at 05:37:17PM +0100, Al Viro wrote: > It would be in v3, but I didn't feel like sending another 63-patch > mailbomb for the sake of these 4 changed commits (well, and a cosmetical > change in #33, with matching modification in #35, ending with both > being cleaner - with the same resulting tree after #35). > > These 4 do repace #59..#62 in v3. Speaking of v3 - does anybody have objections to the following? 1) allow ->show_path() to return -EOPNOTSUPP, interpreted as "fall back to default seq_path(...)"? E.g. kernfs_sop_show_path() could return that if there's no ->scops->show_path(). 2) pass the sodding escape set as explicit argument, made an argument of fs/namespace.c:show_path() as well. 3) similar for ->show_devname(). 4) ... and to hell with those string_unescape_inplace() calls in there. ^ permalink raw reply [flat|nested] 321+ messages in thread
* [RFC] does # really need to be escaped in devnames? 2025-08-30 4:36 ` Al Viro @ 2025-08-30 7:33 ` Al Viro 2025-08-30 19:40 ` Linus Torvalds 0 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-30 7:33 UTC (permalink / raw) To: linux-fsdevel Cc: Linus Torvalds, linux-fsdevel, jack, Siddhesh Poyarekar, Ian Kent, David Howells, Christian Brauner On one hand, we have commit ed5fce76b5ea "vfs: escape hash as well" which added # to the escape set for devname in /prov/*/mount*; on another there's nfs_show_devname() doing seq_escape(m, devname, " \t\n\\"); and similar for btrfs. And then there is afs_show_devname() that outright includes # in that thing on regular basis: char pref = '%'; ... switch (volume->type) { case AFSVL_RWVOL: break; case AFSVL_ROVOL: pref = '#'; if (volume->type_force) suf = ".readonly"; break; case AFSVL_BACKVOL: pref = '#'; suf = ".backup"; break; } seq_printf(m, "%c%s:%s%s", pref, cell->name, volume->name, suf); For NFS and btrfs ones I might be convinced to add # to escape set; for AFS, though, I strongly suspect that userland would be very unhappy, and that's userland predating whatever code that "aims to parse fstab as well as /proc/mounts with the same logic" ed5fce76b5ea is refering to. So... Siddhesh, could you clarify the claim about breaking getmntent(3)? Does it or does it not happen on every system that has readonly AFS volumes mounted? ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [RFC] does # really need to be escaped in devnames? 2025-08-30 7:33 ` [RFC] does # really need to be escaped in devnames? Al Viro @ 2025-08-30 19:40 ` Linus Torvalds 2025-08-30 20:42 ` Al Viro 2025-09-02 15:03 ` Siddhesh Poyarekar 0 siblings, 2 replies; 321+ messages in thread From: Linus Torvalds @ 2025-08-30 19:40 UTC (permalink / raw) To: Al Viro Cc: linux-fsdevel, jack, Siddhesh Poyarekar, Ian Kent, David Howells, Christian Brauner On Sat, 30 Aug 2025 at 00:33, Al Viro <viro@zeniv.linux.org.uk> wrote: > > So... Siddhesh, could you clarify the claim about breaking getmntent(3)? > Does it or does it not happen on every system that has readonly AFS > volumes mounted? Hmm. Looking at various source trees using Debian code search, at least dietlibc doesn't treat '#' specially at all. And glibc seems to treat only a line that *starts* with a '#' (possibly preceded by space/tab combinations) as an empty line. klibc checks for '#' at the beginning of the file (without any potential space skipping before) Busybox seems to do the same "skip whitespace, then skip lines starting with '#'" that glibc does. So I think the '#'-escaping logic is wrong. We should only escape '#' marks at the beginning of a line (since we already escape spaces and tabs, the "preceded by whitespace" doesn't matter). And that means that we shouldn't do it in 'mangle()' at all - because it's irrelevant for any field but the first. And the first field in /proc/mounts is that 'r->mnt_devname' (or show_devname), and again, that should only trigger on the first character, not every character. Now, could there be other libraries that get this even worse wrong? Of course. But Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [RFC] does # really need to be escaped in devnames? 2025-08-30 19:40 ` Linus Torvalds @ 2025-08-30 20:42 ` Al Viro 2025-09-02 15:03 ` Siddhesh Poyarekar 1 sibling, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-30 20:42 UTC (permalink / raw) To: Linus Torvalds Cc: linux-fsdevel, jack, Siddhesh Poyarekar, Ian Kent, David Howells, Christian Brauner On Sat, Aug 30, 2025 at 12:40:32PM -0700, Linus Torvalds wrote: > On Sat, 30 Aug 2025 at 00:33, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > So... Siddhesh, could you clarify the claim about breaking getmntent(3)? > > Does it or does it not happen on every system that has readonly AFS > > volumes mounted? > > Hmm. Looking at various source trees using Debian code search, at > least dietlibc doesn't treat '#' specially at all. > > And glibc seems to treat only a line that *starts* with a '#' > (possibly preceded by space/tab combinations) as an empty line. > > klibc checks for '#' at the beginning of the file (without any > potential space skipping before) > > Busybox seems to do the same "skip whitespace, then skip lines > starting with '#'" that glibc does. > > So I think the '#'-escaping logic is wrong. We should only escape '#' > marks at the beginning of a line (since we already escape spaces and > tabs, the "preceded by whitespace" doesn't matter). > > And that means that we shouldn't do it in 'mangle()' at all - because > it's irrelevant for any field but the first. > > And the first field in /proc/mounts is that 'r->mnt_devname' (or > show_devname), and again, that should only trigger on the first > character, not every character. *nod* Amusingly enough, glibc addmntent(3) does *not* consider # for an octal escape. BTW, another amuzing bogosity: seq_escape(m, "blah", "X") => "blah" seq_escape(m, "blah", "b") => "\142lah" seq_escape(m, "blah", "") => "\142\154\141\150" IOW, about 10 years ago an empty string switched meaning from "escape nothing" to "escape everything"... ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [RFC] does # really need to be escaped in devnames? 2025-08-30 19:40 ` Linus Torvalds 2025-08-30 20:42 ` Al Viro @ 2025-09-02 15:03 ` Siddhesh Poyarekar 2025-09-02 16:30 ` Linus Torvalds 2025-09-02 17:48 ` David Howells 1 sibling, 2 replies; 321+ messages in thread From: Siddhesh Poyarekar @ 2025-09-02 15:03 UTC (permalink / raw) To: Linus Torvalds, Al Viro Cc: linux-fsdevel, jack, Ian Kent, David Howells, Christian Brauner On 2025-08-30 15:40, Linus Torvalds wrote: > On Sat, 30 Aug 2025 at 00:33, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> So... Siddhesh, could you clarify the claim about breaking getmntent(3)? >> Does it or does it not happen on every system that has readonly AFS >> volumes mounted? > > Hmm. Looking at various source trees using Debian code search, at > least dietlibc doesn't treat '#' specially at all. > > And glibc seems to treat only a line that *starts* with a '#' > (possibly preceded by space/tab combinations) as an empty line. > > klibc checks for '#' at the beginning of the file (without any > potential space skipping before) > > Busybox seems to do the same "skip whitespace, then skip lines > starting with '#'" that glibc does. > > So I think the '#'-escaping logic is wrong. We should only escape '#' > marks at the beginning of a line (since we already escape spaces and > tabs, the "preceded by whitespace" doesn't matter). This was actually the original issue I had tried to address, escaping '#' in the beginning of the devname because it ends up in the beginning of the line, thus masking out the entire line in mounts. I don't remember at what point I concluded that escaping '#' always was the answer (maybe to protect against any future instances where userspace ends up ignoring the rest of the line following the '#'), but it appears to be wrong. Sid ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [RFC] does # really need to be escaped in devnames? 2025-09-02 15:03 ` Siddhesh Poyarekar @ 2025-09-02 16:30 ` Linus Torvalds 2025-09-02 16:39 ` Siddhesh Poyarekar 2025-09-02 17:48 ` David Howells 1 sibling, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-09-02 16:30 UTC (permalink / raw) To: Siddhesh Poyarekar Cc: Al Viro, linux-fsdevel, jack, Ian Kent, David Howells, Christian Brauner On Tue, 2 Sept 2025 at 08:03, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote: > > This was actually the original issue I had tried to address, escaping > '#' in the beginning of the devname because it ends up in the beginning > of the line, thus masking out the entire line in mounts. I don't > remember at what point I concluded that escaping '#' always was the > answer (maybe to protect against any future instances where userspace > ends up ignoring the rest of the line following the '#'), but it appears > to be wrong. I wonder if instead of escaping hash-marks we could just disallow them as the first character in devname. How did this issue with hash-marks get found? Is there some real use - in which case we obviously can't disallow them - or was this from some fuzzing test that happened to hit it? Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [RFC] does # really need to be escaped in devnames? 2025-09-02 16:30 ` Linus Torvalds @ 2025-09-02 16:39 ` Siddhesh Poyarekar 0 siblings, 0 replies; 321+ messages in thread From: Siddhesh Poyarekar @ 2025-09-02 16:39 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, linux-fsdevel, jack, Ian Kent, David Howells, Christian Brauner On 2025-09-02 12:30, Linus Torvalds wrote: > On Tue, 2 Sept 2025 at 08:03, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote: >> >> This was actually the original issue I had tried to address, escaping >> '#' in the beginning of the devname because it ends up in the beginning >> of the line, thus masking out the entire line in mounts. I don't >> remember at what point I concluded that escaping '#' always was the >> answer (maybe to protect against any future instances where userspace >> ends up ignoring the rest of the line following the '#'), but it appears >> to be wrong. > > I wonder if instead of escaping hash-marks we could just disallow them > as the first character in devname. > > How did this issue with hash-marks get found? Is there some real use - > in which case we obviously can't disallow them - or was this from some > fuzzing test that happened to hit it? The original issue was that devname being blank broke parsing of mounts, which was fixed with Ian's patch[1]. While debugging that issue I stumbled onto the fact that if the devname started with #, it would make the mount invisible to getmntent in glibc, since it ignores lines starting with #. Sid [1] https://lkml.org/lkml/2022/6/17/27 ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [RFC] does # really need to be escaped in devnames? 2025-09-02 15:03 ` Siddhesh Poyarekar 2025-09-02 16:30 ` Linus Torvalds @ 2025-09-02 17:48 ` David Howells 2025-09-02 20:04 ` Linus Torvalds 1 sibling, 1 reply; 321+ messages in thread From: David Howells @ 2025-09-02 17:48 UTC (permalink / raw) To: Linus Torvalds Cc: dhowells, Siddhesh Poyarekar, Al Viro, linux-fsdevel, jack, Ian Kent, Christian Brauner, Jeffrey Altman, linux-afs Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 2 Sept 2025 at 08:03, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote: > > > > This was actually the original issue I had tried to address, escaping > > '#' in the beginning of the devname because it ends up in the beginning > > of the line, thus masking out the entire line in mounts. I don't > > remember at what point I concluded that escaping '#' always was the > > answer (maybe to protect against any future instances where userspace > > ends up ignoring the rest of the line following the '#'), but it appears > > to be wrong. > > I wonder if instead of escaping hash-marks we could just disallow them > as the first character in devname. The problem with that is that it appears that people are making use of this. Mount /afs with "-o dynroot" isn't a problem as that shouldn't be given a device name - and that's the main way people access AFS. With OpenAFS I don't think you can do this at all since it has a single superblock that it crams everything under. For AuriStor, I think you can mount individual volumes, but I'm not sure how it works. For Linux's AFS, I made every volume have its own superblock. The standard format of AFS volume names is [%#][<cell>:]<volume-name-or-id> but I could make it an option to stick something on the front and use that internally and display that in /proc/mounts, e.g.: mount afs:#openafs.org:afs.root /mnt which would at least mean that sh and bash wouldn't need the "#" escaping. The problem is that the # and the % have specific documented meanings, so if I was to get rid of the '#' entirely, I would need some other marker. Maybe it would be sufficient to just go on the presence or not of a '%'. Maybe I could go with something like: openafs.org:root.cell:ro openafs.org:root.cell:rw openafs.org:root.cell:bak rather than use #/%. I don't think there should be a problem with still accepting lines beginning with '#' in mount() if I display them with an appropriate prefix. That would at least permit backward compatibility. David ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [RFC] does # really need to be escaped in devnames? 2025-09-02 17:48 ` David Howells @ 2025-09-02 20:04 ` Linus Torvalds 0 siblings, 0 replies; 321+ messages in thread From: Linus Torvalds @ 2025-09-02 20:04 UTC (permalink / raw) To: David Howells Cc: Siddhesh Poyarekar, Al Viro, linux-fsdevel, jack, Ian Kent, Christian Brauner, Jeffrey Altman, linux-afs [-- Attachment #1: Type: text/plain, Size: 1275 bytes --] On Tue, 2 Sept 2025 at 10:48, David Howells <dhowells@redhat.com> wrote: > > The problem with that is that it appears that people are making use of this. Ok. So disallowing it isn't in the cards, but let's try to minimize the impact. > The standard format of AFS volume names is [%#][<cell>:]<volume-name-or-id> > but I could make it an option to stick something on the front and use that > internally and display that in /proc/mounts, e.g.: > > mount afs:#openafs.org:afs.root /mnt Yeah, let's aim for trying to avoid the '#' at the beginning when all possible, by trying to make at least the default formats not start with a hash. And then make the escaping logic only escape the hashmark if it's the first character. > I don't think there should be a problem with still accepting lines beginning > with '#' in mount() if I display them with an appropriate prefix. That would > at least permit backward compatibility. Well, right now we obviously escape it everywhere, but how about we make it the rule that 'show_devname()' at least doesn't use it as the first character, and then if somebody uses '#' for the mount name from user space, we would just do the octal-escape then. Something ENTIRELY UNTESTED like this, in other words? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1344 bytes --] fs/afs/super.c | 2 +- fs/proc_namespace.c | 9 +++++++-- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/afs/super.c b/fs/afs/super.c index da407f2d6f0d..31f9cc30ae23 100644 --- a/fs/afs/super.c +++ b/fs/afs/super.c @@ -180,7 +180,7 @@ static int afs_show_devname(struct seq_file *m, struct dentry *root) break; } - seq_printf(m, "%c%s:%s%s", pref, cell->name, volume->name, suf); + seq_printf(m, "afs-%c%s:%s%s", pref, cell->name, volume->name, suf); return 0; } diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 5c555db68aa2..ca5773bfb98e 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -86,7 +86,7 @@ static void show_vfsmnt_opts(struct seq_file *m, struct vfsmount *mnt) static inline void mangle(struct seq_file *m, const char *s) { - seq_escape(m, s, " \t\n\\#"); + seq_escape(m, s, " \t\n\\"); } static void show_type(struct seq_file *m, struct super_block *sb) @@ -111,7 +111,12 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt) if (err) goto out; } else { - mangle(m, r->mnt_devname); + const char *mnt_devname = r->mnt_devname; + if (*mnt_devname == '#') { + seq_printf(m, "\\%o", '#'); + mnt_devname++; + } + mangle(m, mnt_devname); } seq_putc(m, ' '); /* mountpoints outside of chroot jail will give SEQ_SKIP on this */ ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [60/63] setup_mnt(): primitive for connecting a mount to filesystem 2025-08-29 6:05 ` [60/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro 2025-08-29 9:59 ` Christian Brauner @ 2025-09-01 11:17 ` Christian Brauner 1 sibling, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:17 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, jack On Fri, Aug 29, 2025 at 07:05:22AM +0100, Al Viro wrote: > Take the identical logics in vfs_create_mount() and clone_mnt() into > a new helper that takes an empty struct mount and attaches it to > given dentry (sub)tree. > > Should be called once in the lifetime of every mount, prior to making > it visible in any data structures. > > After that point ->mnt_root and ->mnt_sb never change; ->mnt_root > is a counting reference to dentry and ->mnt_sb - an active reference > to superblock. > > Mount remains associated with that dentry tree all the way until > the call of cleanup_mnt(), when the refcount eventually drops > to zero. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags 2025-08-29 6:03 ` Al Viro 2025-08-29 6:04 ` [59/63] simplify the callers of mnt_unhold_writers() Al Viro 2025-08-29 6:05 ` [60/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro @ 2025-08-29 6:06 ` Al Viro 2025-09-01 11:27 ` Christian Brauner 2025-08-29 6:07 ` [62/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro 3 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-29 6:06 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-fsdevel, brauner, jack We have an unpleasant wart in accessibility rules for struct mount. There are per-superblock lists of mounts, used by sb_prepare_remount_readonly() to check if any of those is currently claimed for write access and to block further attempts to get write access on those until we are done. As soon as it is attached to a filesystem, mount becomes reachable via that list. Only sb_prepare_remount_readonly() traverses it and it only accesses a few members of struct mount. Unfortunately, ->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set and then cleared. It is done under mount_lock, so from the locking rules POV everything's fine. However, it has easily overlooked implications - once mount has been attached to a filesystem, it has to be treated as globally visible. In particular, initializing ->mnt_flags *must* be done either prior to that point or under mount_lock. All other members are still private at that point. Life gets simpler if we move that bit (and that's *all* that can get touched by access via this list) out of ->mnt_flags. It's not even hard to do - currently the list is implemented as list_head one, anchored in super_block->s_mounts and linked via mount->mnt_instance. As the first step, switch it to hlist-like open-coded structure - address of the first mount in the set is stored in ->s_mounts and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb - the former either NULL or pointing to the next mount in set, the latter - address of either ->s_mounts or ->mnt_next_for_sb in the previous element of the set. In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as replacement for MNT_WRITE_HOLD. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 4 +++- fs/namespace.c | 38 +++++++++++++++++++++++++++++--------- fs/super.c | 3 +-- include/linux/fs.h | 4 +++- 4 files changed, 36 insertions(+), 13 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index 04d0eadc4c10..b208f69f69d7 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -64,7 +64,9 @@ struct mount { #endif struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ - struct list_head mnt_instance; /* mount instance on sb->s_mounts */ + struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */ + struct mount * __aligned(1) *mnt_pprev_for_sb; + /* except that LSB of pprev will be stolen */ const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; struct list_head mnt_expire; /* link in fs-specific expiry list */ diff --git a/fs/namespace.c b/fs/namespace.c index c769fc4051e0..06be5b65b559 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -730,6 +730,27 @@ static inline void mnt_unhold_writers(struct mount *mnt) mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; } +static inline void mnt_del_instance(struct mount *m) +{ + struct mount **p = m->mnt_pprev_for_sb; + struct mount *next = m->mnt_next_for_sb; + + if (next) + next->mnt_pprev_for_sb = p; + *p = next; +} + +static inline void mnt_add_instance(struct mount *m, struct super_block *s) +{ + struct mount *first = s->s_mounts; + + if (first) + first->mnt_pprev_for_sb = &m->mnt_next_for_sb; + m->mnt_next_for_sb = first; + m->mnt_pprev_for_sb = &s->s_mounts; + s->s_mounts = m; +} + static int mnt_make_readonly(struct mount *mnt) { int ret; @@ -743,7 +764,6 @@ static int mnt_make_readonly(struct mount *mnt) int sb_prepare_remount_readonly(struct super_block *sb) { - struct mount *mnt; int err = 0; /* Racy optimization. Recheck the counter under MNT_WRITE_HOLD */ @@ -751,9 +771,9 @@ int sb_prepare_remount_readonly(struct super_block *sb) return -EBUSY; lock_mount_hash(); - list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { - if (!(mnt->mnt.mnt_flags & MNT_READONLY)) { - err = mnt_hold_writers(mnt); + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { + if (!(m->mnt.mnt_flags & MNT_READONLY)) { + err = mnt_hold_writers(m); if (err) break; } @@ -763,9 +783,9 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err) sb_start_ro_state_change(sb); - list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { - if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { + if (m->mnt.mnt_flags & MNT_WRITE_HOLD) + m->mnt.mnt_flags &= ~MNT_WRITE_HOLD; } unlock_mount_hash(); @@ -1207,7 +1227,7 @@ static void setup_mnt(struct mount *m, struct dentry *root) m->mnt_parent = m; lock_mount_hash(); - list_add_tail(&m->mnt_instance, &s->s_mounts); + mnt_add_instance(m, s); unlock_mount_hash(); } @@ -1425,7 +1445,7 @@ static void mntput_no_expire(struct mount *mnt) mnt->mnt.mnt_flags |= MNT_DOOMED; rcu_read_unlock(); - list_del(&mnt->mnt_instance); + mnt_del_instance(mnt); if (unlikely(!list_empty(&mnt->mnt_expire))) list_del(&mnt->mnt_expire); diff --git a/fs/super.c b/fs/super.c index 7f876f32343a..3b0f49e1b817 100644 --- a/fs/super.c +++ b/fs/super.c @@ -323,7 +323,6 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, if (!s) return NULL; - INIT_LIST_HEAD(&s->s_mounts); s->s_user_ns = get_user_ns(user_ns); init_rwsem(&s->s_umount); lockdep_set_class(&s->s_umount, &type->s_umount_key); @@ -408,7 +407,7 @@ static void __put_super(struct super_block *s) list_del_init(&s->s_list); WARN_ON(s->s_dentry_lru.node); WARN_ON(s->s_inode_lru.node); - WARN_ON(!list_empty(&s->s_mounts)); + WARN_ON(s->s_mounts); call_rcu(&s->rcu, destroy_super_rcu); } } diff --git a/include/linux/fs.h b/include/linux/fs.h index d7ab4f96d705..0e9c7f1460dc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1324,6 +1324,8 @@ struct sb_writers { struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; }; +struct mount; + struct super_block { struct list_head s_list; /* Keep this first */ dev_t s_dev; /* search index; _not_ kdev_t */ @@ -1358,7 +1360,7 @@ struct super_block { __u16 s_encoding_flags; #endif struct hlist_bl_head s_roots; /* alternate root dentries for NFS */ - struct list_head s_mounts; /* list of mounts; _not_ for fs use */ + struct mount *s_mounts; /* list of mounts; _not_ for fs use */ struct block_device *s_bdev; /* can go away once we use an accessor for @s_bdev_file */ struct file *s_bdev_file; struct backing_dev_info *s_bdi; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags 2025-08-29 6:06 ` [61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro @ 2025-09-01 11:27 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:27 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, jack On Fri, Aug 29, 2025 at 07:06:15AM +0100, Al Viro wrote: > We have an unpleasant wart in accessibility rules for struct mount. There > are per-superblock lists of mounts, used by sb_prepare_remount_readonly() > to check if any of those is currently claimed for write access and to > block further attempts to get write access on those until we are done. > > As soon as it is attached to a filesystem, mount becomes reachable > via that list. Only sb_prepare_remount_readonly() traverses it and > it only accesses a few members of struct mount. Unfortunately, > ->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set > and then cleared. It is done under mount_lock, so from the locking > rules POV everything's fine. > > However, it has easily overlooked implications - once mount has been > attached to a filesystem, it has to be treated as globally visible. > In particular, initializing ->mnt_flags *must* be done either prior > to that point or under mount_lock. All other members are still > private at that point. > > Life gets simpler if we move that bit (and that's *all* that can get > touched by access via this list) out of ->mnt_flags. It's not even > hard to do - currently the list is implemented as list_head one, > anchored in super_block->s_mounts and linked via mount->mnt_instance. > > As the first step, switch it to hlist-like open-coded structure - > address of the first mount in the set is stored in ->s_mounts > and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb - > the former either NULL or pointing to the next mount in set, the > latter - address of either ->s_mounts or ->mnt_next_for_sb in the > previous element of the set. > > In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as > replacement for MNT_WRITE_HOLD. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [62/63] struct mount: relocate MNT_WRITE_HOLD bit 2025-08-29 6:03 ` Al Viro ` (2 preceding siblings ...) 2025-08-29 6:06 ` [61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro @ 2025-08-29 6:07 ` Al Viro 2025-09-01 11:26 ` Christian Brauner 3 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-29 6:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-fsdevel, brauner, jack ... from ->mnt_flags to LSB of ->mnt_pprev_for_sb. This is safe - we always set and clear it within the same mount_lock scope, so we won't interfere with list operations - traversals are always forward, so they don't even look at ->mnt_prev_for_sb and both insertions and removals are in mount_lock scopes of their own, so that bit will be clear in *all* mount instances during those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 25 ++++++++++++++++++++++++- fs/namespace.c | 34 +++++++++++++++++----------------- include/linux/mount.h | 3 +-- 3 files changed, 42 insertions(+), 20 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index b208f69f69d7..40cf16544317 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -66,7 +66,8 @@ struct mount { struct list_head mnt_child; /* and going through their mnt_child */ struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */ struct mount * __aligned(1) *mnt_pprev_for_sb; - /* except that LSB of pprev will be stolen */ + /* except that LSB of pprev is stolen */ +#define WRITE_HOLD 1 /* ... for use by mnt_hold_writers() */ const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; struct list_head mnt_expire; /* link in fs-specific expiry list */ @@ -244,4 +245,26 @@ static inline struct mount *topmost_overmount(struct mount *m) return m; } +static inline bool __test_write_hold(struct mount * __aligned(1) *val) +{ + return (unsigned long)val & WRITE_HOLD; +} + +static inline bool test_write_hold(const struct mount *m) +{ + return __test_write_hold(m->mnt_pprev_for_sb); +} + +static inline void set_write_hold(struct mount *m) +{ + m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb + | WRITE_HOLD); +} + +static inline void clear_write_hold(struct mount *m) +{ + m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb + & ~WRITE_HOLD); +} + struct mnt_namespace *mnt_ns_from_dentry(struct dentry *dentry); diff --git a/fs/namespace.c b/fs/namespace.c index 06be5b65b559..8e6b6523d3e8 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -509,20 +509,20 @@ int mnt_get_write_access(struct vfsmount *m) mnt_inc_writers(mnt); /* * The store to mnt_inc_writers must be visible before we pass - * MNT_WRITE_HOLD loop below, so that the slowpath can see our - * incremented count after it has set MNT_WRITE_HOLD. + * WRITE_HOLD loop below, so that the slowpath can see our + * incremented count after it has set WRITE_HOLD. */ smp_mb(); might_lock(&mount_lock.lock); - while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) { + while (__test_write_hold(READ_ONCE(mnt->mnt_pprev_for_sb))) { if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { cpu_relax(); } else { /* * This prevents priority inversion, if the task - * setting MNT_WRITE_HOLD got preempted on a remote + * setting WRITE_HOLD got preempted on a remote * CPU, and it prevents life lock if the task setting - * MNT_WRITE_HOLD has a lower priority and is bound to + * WRITE_HOLD has a lower priority and is bound to * the same CPU as the task that is spinning here. */ preempt_enable(); @@ -533,7 +533,7 @@ int mnt_get_write_access(struct vfsmount *m) } /* * The barrier pairs with the barrier sb_start_ro_state_change() making - * sure that if we see MNT_WRITE_HOLD cleared, we will also see + * sure that if we see WRITE_HOLD cleared, we will also see * s_readonly_remount set (or even SB_RDONLY / MNT_READONLY flags) in * mnt_is_readonly() and bail in case we are racing with remount * read-only. @@ -672,15 +672,15 @@ EXPORT_SYMBOL(mnt_drop_write_file); * @mnt. * * Context: This function expects lock_mount_hash() to be held serializing - * setting MNT_WRITE_HOLD. + * setting WRITE_HOLD. * Return: On success 0 is returned. * On error, -EBUSY is returned. */ static inline int mnt_hold_writers(struct mount *mnt) { - mnt->mnt.mnt_flags |= MNT_WRITE_HOLD; + set_write_hold(mnt); /* - * After storing MNT_WRITE_HOLD, we'll read the counters. This store + * After storing WRITE_HOLD, we'll read the counters. This store * should be visible before we do. */ smp_mb(); @@ -696,9 +696,9 @@ static inline int mnt_hold_writers(struct mount *mnt) * sum up each counter, if we read a counter before it is incremented, * but then read another CPU's count which it has been subsequently * decremented from -- we would see more decrements than we should. - * MNT_WRITE_HOLD protects against this scenario, because + * WRITE_HOLD protects against this scenario, because * mnt_want_write first increments count, then smp_mb, then spins on - * MNT_WRITE_HOLD, so it can't be decremented by another CPU while + * WRITE_HOLD, so it can't be decremented by another CPU while * we're counting up here. */ if (mnt_get_writers(mnt) > 0) @@ -720,14 +720,14 @@ static inline int mnt_hold_writers(struct mount *mnt) */ static inline void mnt_unhold_writers(struct mount *mnt) { - if (!(mnt->mnt_flags & MNT_WRITE_HOLD)) + if (!test_write_hold(mnt)) return; /* - * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers + * MNT_READONLY must become visible before ~WRITE_HOLD, so writers * that become unheld will see MNT_READONLY. */ smp_wmb(); - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + clear_write_hold(mnt); } static inline void mnt_del_instance(struct mount *m) @@ -766,7 +766,7 @@ int sb_prepare_remount_readonly(struct super_block *sb) { int err = 0; - /* Racy optimization. Recheck the counter under MNT_WRITE_HOLD */ + /* Racy optimization. Recheck the counter under WRITE_HOLD */ if (atomic_long_read(&sb->s_remove_count)) return -EBUSY; @@ -784,8 +784,8 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err) sb_start_ro_state_change(sb); for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { - if (m->mnt.mnt_flags & MNT_WRITE_HOLD) - m->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + if (test_write_hold(m)) + clear_write_hold(m); } unlock_mount_hash(); diff --git a/include/linux/mount.h b/include/linux/mount.h index 18e4b97f8a98..85e97b9340ff 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -33,7 +33,6 @@ enum mount_flags { MNT_NOSYMFOLLOW = 0x80, MNT_SHRINKABLE = 0x100, - MNT_WRITE_HOLD = 0x200, MNT_INTERNAL = 0x4000, @@ -52,7 +51,7 @@ enum mount_flags { | MNT_READONLY | MNT_NOSYMFOLLOW, MNT_ATIME_MASK = MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME, - MNT_INTERNAL_FLAGS = MNT_WRITE_HOLD | MNT_INTERNAL | MNT_DOOMED | + MNT_INTERNAL_FLAGS = MNT_INTERNAL | MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_LOCKED }; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [62/63] struct mount: relocate MNT_WRITE_HOLD bit 2025-08-29 6:07 ` [62/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro @ 2025-09-01 11:26 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:26 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, jack On Fri, Aug 29, 2025 at 07:07:05AM +0100, Al Viro wrote: > ... from ->mnt_flags to LSB of ->mnt_pprev_for_sb. > > This is safe - we always set and clear it within the same mount_lock > scope, so we won't interfere with list operations - traversals are > always forward, so they don't even look at ->mnt_prev_for_sb and > both insertions and removals are in mount_lock scopes of their own, > so that bit will be clear in *all* mount instances during those. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/mount.h | 25 ++++++++++++++++++++++++- > fs/namespace.c | 34 +++++++++++++++++----------------- > include/linux/mount.h | 3 +-- > 3 files changed, 42 insertions(+), 20 deletions(-) > > diff --git a/fs/mount.h b/fs/mount.h > index b208f69f69d7..40cf16544317 100644 > --- a/fs/mount.h > +++ b/fs/mount.h > @@ -66,7 +66,8 @@ struct mount { > struct list_head mnt_child; /* and going through their mnt_child */ > struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */ > struct mount * __aligned(1) *mnt_pprev_for_sb; > - /* except that LSB of pprev will be stolen */ > + /* except that LSB of pprev is stolen */ > +#define WRITE_HOLD 1 /* ... for use by mnt_hold_writers() */ > const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ > struct list_head mnt_list; > struct list_head mnt_expire; /* link in fs-specific expiry list */ > @@ -244,4 +245,26 @@ static inline struct mount *topmost_overmount(struct mount *m) > return m; > } > > +static inline bool __test_write_hold(struct mount * __aligned(1) *val) > +{ > + return (unsigned long)val & WRITE_HOLD; > +} > + > +static inline bool test_write_hold(const struct mount *m) > +{ > + return __test_write_hold(m->mnt_pprev_for_sb); > +} > + > +static inline void set_write_hold(struct mount *m) > +{ > + m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb > + | WRITE_HOLD); > +} > + > +static inline void clear_write_hold(struct mount *m) > +{ > + m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb > + & ~WRITE_HOLD); > +} I have to say that I find this really unpleasant but... I've seen issues withe current MNT_WRITE_HOLD handling before when it interacted with MNT_ONRB (I killed that a while ago), Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v2 62/63] simplify the callers of mnt_unhold_writers() 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (59 preceding siblings ...) 2025-08-28 23:08 ` [PATCH v2 61/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro @ 2025-08-28 23:08 ` Al Viro 2025-08-28 23:08 ` [PATCH v2 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount Al Viro 61 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-08-28 23:08 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds The logics in cleanup on failure in mount_setattr_prepare() is simplified by having the mnt_hold_writers() failure followed by advancing m to the next node in the tree before leaving the loop. And since all calls are preceded by the same check that flag has been set and the function is inlined, let's just shift the check into it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 34 ++++++++++------------------------ 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index f9c9c69a815b..6b439e5e5a27 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -714,13 +714,14 @@ static inline int mnt_hold_writers(struct mount *mnt) * Stop preventing write access to @mnt allowing callers to gain write access * to @mnt again. * - * This function can only be called after a successful call to - * mnt_hold_writers(). + * This function can only be called after a call to mnt_hold_writers(). * * Context: This function expects lock_mount_hash() to be held. */ static inline void mnt_unhold_writers(struct mount *mnt) { + if (!(mnt->mnt_pprev_for_sb & WRITE_HOLD)) + return; /* * MNT_READONLY must become visible before ~WRITE_HOLD, so writers * that become unheld will see MNT_READONLY. @@ -4793,8 +4794,10 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) if (!mnt_allow_writers(kattr, m)) { err = mnt_hold_writers(m); - if (err) + if (err) { + m = next_mnt(m, mnt); break; + } } if (!(kattr->kflags & MOUNT_KATTR_RECURSE)) @@ -4802,25 +4805,9 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) } if (err) { - struct mount *p; - - /* - * If we had to call mnt_hold_writers() WRITE_HOLD will - * be set in @mnt_flags. The loop unsets WRITE_HOLD for all - * mounts and needs to take care to include the first mount. - */ - for (p = mnt; p; p = next_mnt(p, mnt)) { - /* If we had to hold writers unblock them. */ - if (p->mnt_pprev_for_sb & WRITE_HOLD) - mnt_unhold_writers(p); - - /* - * We're done once the first mount we changed got - * WRITE_HOLD unset. - */ - if (p == m) - break; - } + /* undo all mnt_hold_writers() we'd done */ + for (struct mount *p = mnt; p != m; p = next_mnt(p, mnt)) + mnt_unhold_writers(p); } return err; } @@ -4851,8 +4838,7 @@ static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt) WRITE_ONCE(m->mnt.mnt_flags, flags); /* If we had to hold writers unblock them. */ - if (mnt->mnt_pprev_for_sb & WRITE_HOLD) - mnt_unhold_writers(m); + mnt_unhold_writers(m); if (kattr->propagation) change_mnt_propagation(m, kattr->propagation); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v2 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (60 preceding siblings ...) 2025-08-28 23:08 ` [PATCH v2 62/63] simplify the callers of mnt_unhold_writers() Al Viro @ 2025-08-28 23:08 ` Al Viro 2025-09-01 11:28 ` Christian Brauner 61 siblings, 1 reply; 321+ messages in thread From: Al Viro @ 2025-08-28 23:08 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... neither for insertion into the list of instances, nor for mnt_{un,}hold_writers(), nor for mnt_get_write_access() deciding to be nice to RT during a busy-wait loop - all of that only needs the spinlock side of mount_lock. IOW, it's mount_locked_reader, not mount_writer. Clarify the comment re locking rules for mnt_unhold_writers() - it's not just that mount_lock needs to be held when calling that, it must have been held all along since the matching mnt_hold_writers(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 6b439e5e5a27..545fef0682b1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -526,8 +526,8 @@ int mnt_get_write_access(struct vfsmount *m) * the same CPU as the task that is spinning here. */ preempt_enable(); - lock_mount_hash(); - unlock_mount_hash(); + read_seqlock_excl(&mount_lock); + read_sequnlock_excl(&mount_lock); preempt_disable(); } } @@ -671,7 +671,7 @@ EXPORT_SYMBOL(mnt_drop_write_file); * a call to mnt_unhold_writers() in order to stop preventing write access to * @mnt. * - * Context: This function expects lock_mount_hash() to be held serializing + * Context: This function expects to be in mount_locked_reader scope serializing * setting WRITE_HOLD. * Return: On success 0 is returned. * On error, -EBUSY is returned. @@ -716,7 +716,8 @@ static inline int mnt_hold_writers(struct mount *mnt) * * This function can only be called after a call to mnt_hold_writers(). * - * Context: This function expects lock_mount_hash() to be held. + * Context: This function expects to be in the same mount_locked_reader scope + * as the matching mnt_hold_writers(). */ static inline void mnt_unhold_writers(struct mount *mnt) { @@ -770,7 +771,8 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (atomic_long_read(&sb->s_remove_count)) return -EBUSY; - lock_mount_hash(); + guard(mount_locked_reader)(); + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { if (!(m->mnt.mnt_flags & MNT_READONLY)) { err = mnt_hold_writers(m); @@ -787,7 +789,6 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (m->mnt_pprev_for_sb & WRITE_HOLD) m->mnt_pprev_for_sb &= ~WRITE_HOLD; } - unlock_mount_hash(); return err; } @@ -1226,9 +1227,8 @@ static void setup_mnt(struct mount *m, struct dentry *root) m->mnt_mountpoint = m->mnt.mnt_root; m->mnt_parent = m; - lock_mount_hash(); + guard(mount_locked_reader)(); mnt_add_instance(m, s); - unlock_mount_hash(); } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCH v2 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount 2025-08-28 23:08 ` [PATCH v2 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount Al Viro @ 2025-09-01 11:28 ` Christian Brauner 0 siblings, 0 replies; 321+ messages in thread From: Christian Brauner @ 2025-09-01 11:28 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, jack, torvalds On Fri, Aug 29, 2025 at 12:08:06AM +0100, Al Viro wrote: > ... neither for insertion into the list of instances, nor for > mnt_{un,}hold_writers(), nor for mnt_get_write_access() deciding > to be nice to RT during a busy-wait loop - all of that only needs > the spinlock side of mount_lock. > > IOW, it's mount_locked_reader, not mount_writer. > > Clarify the comment re locking rules for mnt_unhold_writers() - it's > not just that mount_lock needs to be held when calling that, it must > have been held all along since the matching mnt_hold_writers(). > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- Reviewed-by: Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCHES v3][RFC][CFT] mount-related stuff 2025-08-28 23:07 ` [PATCHES v2][RFC][CFT] " Al Viro 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (2 more replies) 1 sibling, 3 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner, Jan Kara Branch force-pushed into git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount (also visible as #v3.mount, #v[12].mount being the previous versions) Individual patches in followups. If nobody objects, this goes into #for-next. Changes since v2 (other than applied r-b): #26: typo fix in description (do_new_mount_rc -> do_new_mount_fc) #33, #35: massage suggested by Linus #35: where_to_mount() massage, more or less along the lines of Christian's suggestion. between #52 and #53: document locking in patch_check_mount() and use guard() in its caller (path_has_submounts()). #56 (now #57): fixed editing braino in commit message #58 (now #59): restored lost mnt_ns_tree_add() #59..63 (now #60..64): rewritten (as posted last week) added in the end of the series: constify {__,}mnt_is_readonly() Diffstat: fs/dcache.c | 4 +- fs/ecryptfs/dentry.c | 14 +- fs/ecryptfs/ecryptfs_kernel.h | 27 +- fs/ecryptfs/file.c | 15 +- fs/ecryptfs/inode.c | 19 +- fs/ecryptfs/main.c | 24 +- fs/internal.h | 4 +- fs/mount.h | 39 +- fs/namespace.c | 992 ++++++++++++++++++++---------------------- fs/pnode.c | 75 +++- fs/pnode.h | 1 + fs/super.c | 3 +- include/linux/fs.h | 4 +- include/linux/mount.h | 9 +- kernel/audit_tree.c | 12 +- 15 files changed, 603 insertions(+), 639 deletions(-) ^ permalink raw reply [flat|nested] 321+ messages in thread
* [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess 2025-09-03 4:54 ` [PATCHES v3][RFC][CFT] mount-related stuff Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 02/65] introduced guards for mount_lock Al Viro ` (74 more replies) 2025-09-03 5:08 ` [PATCHES v3][RFC][CFT] mount-related stuff Al Viro 2025-09-03 14:47 ` Linus Torvalds 2 siblings, 75 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds If anything, namespace_lock should be DEFINE_LOCK_GUARD_0, not DEFINE_GUARD. That way we * do not need to feed it a bogus argument * do not get gcc trying to compare an address of static in file variable with -4097 - and, if we are unlucky, trying to keep it in a register, with spills and all such. The same problems apply to grabbing namespace_sem shared. Rename it to namespace_excl, add namespace_shared, convert the existing users: guard(namespace_lock, &namespace_sem) => guard(namespace_excl)() guard(rwsem_read, &namespace_sem) => guard(namespace_shared)() scoped_guard(namespace_lock, &namespace_sem) => scoped_guard(namespace_excl) scoped_guard(rwsem_read, &namespace_sem) => scoped_guard(namespace_shared) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index ae6d1312b184..fcea65587ff9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -82,6 +82,12 @@ static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ static struct mnt_namespace *emptied_ns; /* protected by namespace_sem */ static DEFINE_SEQLOCK(mnt_ns_tree_lock); +static inline void namespace_lock(void); +static void namespace_unlock(void); +DEFINE_LOCK_GUARD_0(namespace_excl, namespace_lock(), namespace_unlock()) +DEFINE_LOCK_GUARD_0(namespace_shared, down_read(&namespace_sem), + up_read(&namespace_sem)) + #ifdef CONFIG_FSNOTIFY LIST_HEAD(notify_list); /* protected by namespace_sem */ #endif @@ -1776,8 +1782,6 @@ static inline void namespace_lock(void) down_write(&namespace_sem); } -DEFINE_GUARD(namespace_lock, struct rw_semaphore *, namespace_lock(), namespace_unlock()) - enum umount_tree_flags { UMOUNT_SYNC = 1, UMOUNT_PROPAGATE = 2, @@ -2306,7 +2310,7 @@ struct path *collect_paths(const struct path *path, struct path *res = prealloc, *to_free = NULL; unsigned n = 0; - guard(rwsem_read)(&namespace_sem); + guard(namespace_shared)(); if (!check_mnt(root)) return ERR_PTR(-EINVAL); @@ -2361,7 +2365,7 @@ void dissolve_on_fput(struct vfsmount *mnt) return; } - scoped_guard(namespace_lock, &namespace_sem) { + scoped_guard(namespace_excl) { if (!anon_ns_root(m)) return; @@ -2435,7 +2439,7 @@ struct vfsmount *clone_private_mount(const struct path *path) struct mount *old_mnt = real_mount(path->mnt); struct mount *new_mnt; - guard(rwsem_read)(&namespace_sem); + guard(namespace_shared)(); if (IS_MNT_UNBINDABLE(old_mnt)) return ERR_PTR(-EINVAL); @@ -5957,7 +5961,7 @@ SYSCALL_DEFINE4(statmount, const struct mnt_id_req __user *, req, if (ret) return ret; - scoped_guard(rwsem_read, &namespace_sem) + scoped_guard(namespace_shared) ret = do_statmount(ks, kreq.mnt_id, kreq.mnt_ns_id, ns); if (!ret) @@ -6079,7 +6083,7 @@ SYSCALL_DEFINE4(listmount, const struct mnt_id_req __user *, req, * We only need to guard against mount topology changes as * listmount() doesn't care about any mount properties. */ - scoped_guard(rwsem_read, &namespace_sem) + scoped_guard(namespace_shared) ret = do_listmount(ns, kreq.mnt_id, last_mnt_id, kmnt_ids, nr_mnt_ids, (flags & LISTMOUNT_REVERSE)); if (ret <= 0) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 02/65] introduced guards for mount_lock 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 03/65] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro ` (73 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds mount_writer: write_seqlock; that's an equivalent of {un,}lock_mount_hash() mount_locked_reader: read_seqlock_excl; these tend to be open-coded. No bulk conversions, please - if nothing else, quite a few places take use mount_writer form when mount_locked_reader is sufficent. It needs to be dealt with carefully. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/mount.h b/fs/mount.h index 97737051a8b9..ed8c83ba836a 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -154,6 +154,11 @@ static inline void get_mnt_ns(struct mnt_namespace *ns) extern seqlock_t mount_lock; +DEFINE_LOCK_GUARD_0(mount_writer, write_seqlock(&mount_lock), + write_sequnlock(&mount_lock)) +DEFINE_LOCK_GUARD_0(mount_locked_reader, read_seqlock_excl(&mount_lock), + read_sequnlock_excl(&mount_lock)) + struct proc_mounts { struct mnt_namespace *ns; struct path root; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 03/65] fs/namespace.c: allow to drop vfsmount references via __free(mntput) 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-09-03 4:54 ` [PATCH v3 02/65] introduced guards for mount_lock Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 04/65] __detach_mounts(): use guards Al Viro ` (72 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Note that just as path_put, it should never be done in scope of namespace_sem, be it shared or exclusive. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/namespace.c b/fs/namespace.c index fcea65587ff9..767ab751ee2a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -88,6 +88,8 @@ DEFINE_LOCK_GUARD_0(namespace_excl, namespace_lock(), namespace_unlock()) DEFINE_LOCK_GUARD_0(namespace_shared, down_read(&namespace_sem), up_read(&namespace_sem)) +DEFINE_FREE(mntput, struct vfsmount *, if (!IS_ERR(_T)) mntput(_T)) + #ifdef CONFIG_FSNOTIFY LIST_HEAD(notify_list); /* protected by namespace_sem */ #endif -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 04/65] __detach_mounts(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-09-03 4:54 ` [PATCH v3 02/65] introduced guards for mount_lock Al Viro 2025-09-03 4:54 ` [PATCH v3 03/65] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 05/65] __is_local_mountpoint(): " Al Viro ` (71 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Clean fit for guards use; guards can't be weaker due to umount_tree() calls. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 767ab751ee2a..1ae1ab8815c9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2032,10 +2032,11 @@ void __detach_mounts(struct dentry *dentry) struct pinned_mountpoint mp = {}; struct mount *mnt; - namespace_lock(); - lock_mount_hash(); + guard(namespace_excl)(); + guard(mount_writer)(); + if (!lookup_mountpoint(dentry, &mp)) - goto out_unlock; + return; event++; while (mp.node.next) { @@ -2047,9 +2048,6 @@ void __detach_mounts(struct dentry *dentry) else umount_tree(mnt, UMOUNT_CONNECTED); } unpin_mountpoint(&mp); -out_unlock: - unlock_mount_hash(); - namespace_unlock(); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 05/65] __is_local_mountpoint(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (2 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 04/65] __detach_mounts(): use guards Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 06/65] do_change_type(): " Al Viro ` (70 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_shared due to iterating through ns->mounts. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 1ae1ab8815c9..f1460ddd1486 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -906,17 +906,14 @@ bool __is_local_mountpoint(const struct dentry *dentry) { struct mnt_namespace *ns = current->nsproxy->mnt_ns; struct mount *mnt, *n; - bool is_covered = false; - down_read(&namespace_sem); - rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) { - is_covered = (mnt->mnt_mountpoint == dentry); - if (is_covered) - break; - } - up_read(&namespace_sem); + guard(namespace_shared)(); + + rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) + if (mnt->mnt_mountpoint == dentry) + return true; - return is_covered; + return false; } struct pinned_mountpoint { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 06/65] do_change_type(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (3 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 05/65] __is_local_mountpoint(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 07/65] do_set_group(): " Al Viro ` (69 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_excl to modify propagation graph Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index f1460ddd1486..a6a7b068770a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2899,7 +2899,7 @@ static int do_change_type(struct path *path, int ms_flags) struct mount *mnt = real_mount(path->mnt); int recurse = ms_flags & MS_REC; int type; - int err = 0; + int err; if (!path_mounted(path)) return -EINVAL; @@ -2908,23 +2908,22 @@ static int do_change_type(struct path *path, int ms_flags) if (!type) return -EINVAL; - namespace_lock(); + guard(namespace_excl)(); + err = may_change_propagation(mnt); if (err) - goto out_unlock; + return err; if (type == MS_SHARED) { err = invent_group_ids(mnt, recurse); if (err) - goto out_unlock; + return err; } for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL)) change_mnt_propagation(m, type); - out_unlock: - namespace_unlock(); - return err; + return 0; } /* may_copy_tree() - check if a mount tree can be copied -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 07/65] do_set_group(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (4 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 06/65] do_change_type(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 08/65] mark_mounts_for_expiry(): " Al Viro ` (68 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_excl to modify propagation graph Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 33 +++++++++++++-------------------- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a6a7b068770a..13e2f3837a26 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3349,47 +3349,44 @@ static inline int tree_contains_unbindable(struct mount *mnt) static int do_set_group(struct path *from_path, struct path *to_path) { - struct mount *from, *to; + struct mount *from = real_mount(from_path->mnt); + struct mount *to = real_mount(to_path->mnt); int err; - from = real_mount(from_path->mnt); - to = real_mount(to_path->mnt); - - namespace_lock(); + guard(namespace_excl)(); err = may_change_propagation(from); if (err) - goto out; + return err; err = may_change_propagation(to); if (err) - goto out; + return err; - err = -EINVAL; /* To and From paths should be mount roots */ if (!path_mounted(from_path)) - goto out; + return -EINVAL; if (!path_mounted(to_path)) - goto out; + return -EINVAL; /* Setting sharing groups is only allowed across same superblock */ if (from->mnt.mnt_sb != to->mnt.mnt_sb) - goto out; + return -EINVAL; /* From mount root should be wider than To mount root */ if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root)) - goto out; + return -EINVAL; /* From mount should not have locked children in place of To's root */ if (__has_locked_children(from, to->mnt.mnt_root)) - goto out; + return -EINVAL; /* Setting sharing groups is only allowed on private mounts */ if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to)) - goto out; + return -EINVAL; /* From should not be private */ if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from)) - goto out; + return -EINVAL; if (IS_MNT_SLAVE(from)) { hlist_add_behind(&to->mnt_slave, &from->mnt_slave); @@ -3401,11 +3398,7 @@ static int do_set_group(struct path *from_path, struct path *to_path) list_add(&to->mnt_share, &from->mnt_share); set_mnt_shared(to); } - - err = 0; -out: - namespace_unlock(); - return err; + return 0; } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 08/65] mark_mounts_for_expiry(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (5 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 07/65] do_set_group(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 09/65] put_mnt_ns(): " Al Viro ` (67 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Clean fit; guards can't be weaker due to umount_tree() calls. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 13e2f3837a26..898a6b7307e4 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3886,8 +3886,8 @@ void mark_mounts_for_expiry(struct list_head *mounts) if (list_empty(mounts)) return; - namespace_lock(); - lock_mount_hash(); + guard(namespace_excl)(); + guard(mount_writer)(); /* extract from the expiration list every vfsmount that matches the * following criteria: @@ -3909,8 +3909,6 @@ void mark_mounts_for_expiry(struct list_head *mounts) touch_mnt_namespace(mnt->mnt_ns); umount_tree(mnt, UMOUNT_PROPAGATE|UMOUNT_SYNC); } - unlock_mount_hash(); - namespace_unlock(); } EXPORT_SYMBOL_GPL(mark_mounts_for_expiry); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 09/65] put_mnt_ns(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (6 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 08/65] mark_mounts_for_expiry(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 10/65] mnt_already_visible(): " Al Viro ` (66 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; guards can't be weaker due to umount_tree() call. Setting emptied_ns requires namespace_excl, but not anything mount_lock-related. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 898a6b7307e4..86a86be2b0ef 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6153,12 +6153,10 @@ void put_mnt_ns(struct mnt_namespace *ns) { if (!refcount_dec_and_test(&ns->ns.count)) return; - namespace_lock(); + guard(namespace_excl)(); emptied_ns = ns; - lock_mount_hash(); + guard(mount_writer)(); umount_tree(ns->root, 0); - unlock_mount_hash(); - namespace_unlock(); } struct vfsmount *kern_mount(struct file_system_type *type) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 10/65] mnt_already_visible(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (7 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 09/65] put_mnt_ns(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 11/65] check_for_nsfs_mounts(): no need to take locks Al Viro ` (65 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds clean fit; namespace_shared due to iterating through ns->mounts. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 86a86be2b0ef..a5d37b97088f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6232,9 +6232,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns, { int new_flags = *new_mnt_flags; struct mount *mnt, *n; - bool visible = false; - down_read(&namespace_sem); + guard(namespace_shared)(); rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) { struct mount *child; int mnt_flags; @@ -6281,13 +6280,10 @@ static bool mnt_already_visible(struct mnt_namespace *ns, /* Preserve the locked attributes */ *new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \ MNT_LOCK_ATIME); - visible = true; - goto found; + return true; next: ; } -found: - up_read(&namespace_sem); - return visible; + return false; } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 11/65] check_for_nsfs_mounts(): no need to take locks 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (8 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 10/65] mnt_already_visible(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 12/65] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro ` (64 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Currently we are taking mount_writer; what that function needs is either mount_locked_reader (we are not changing anything, we just want to iterate through the subtree) or namespace_shared and a reference held by caller on the root of subtree - that's also enough to stabilize the topology. The thing is, all callers are already holding at least namespace_shared as well as a reference to the root of subtree. Let's make the callers provide locking warranties - don't mess with mount_lock in check_for_nsfs_mounts() itself and document the locking requirements. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a5d37b97088f..59948cbf9c47 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2402,21 +2402,15 @@ bool has_locked_children(struct mount *mnt, struct dentry *dentry) * specified subtree. Such references can act as pins for mount namespaces * that aren't checked by the mount-cycle checking code, thereby allowing * cycles to be made. + * + * locks: mount_locked_reader || namespace_shared && pinned(subtree) */ static bool check_for_nsfs_mounts(struct mount *subtree) { - struct mount *p; - bool ret = false; - - lock_mount_hash(); - for (p = subtree; p; p = next_mnt(p, subtree)) + for (struct mount *p = subtree; p; p = next_mnt(p, subtree)) if (mnt_ns_loop(p->mnt.mnt_root)) - goto out; - - ret = true; -out: - unlock_mount_hash(); - return ret; + return false; + return true; } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 12/65] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (9 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 11/65] check_for_nsfs_mounts(): no need to take locks Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 13/65] has_locked_children(): use guards Al Viro ` (63 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/pnode.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/pnode.c b/fs/pnode.c index 6f7d02f3fa98..0702d45d856d 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -304,9 +304,8 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp, err = PTR_ERR(this); break; } - read_seqlock_excl(&mount_lock); - mnt_set_mountpoint(n, dest_mp, this); - read_sequnlock_excl(&mount_lock); + scoped_guard(mount_locked_reader) + mnt_set_mountpoint(n, dest_mp, this); if (n->mnt_master) SET_MNT_MARK(n->mnt_master); copy = this; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 13/65] has_locked_children(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (10 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 12/65] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 14/65] mnt_set_expiry(): " Al Viro ` (62 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and document the locking requirements of __has_locked_children() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 59948cbf9c47..2cb3cb8307ca 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2373,6 +2373,7 @@ void dissolve_on_fput(struct vfsmount *mnt) } } +/* locks: namespace_shared && pinned(mnt) || mount_locked_reader */ static bool __has_locked_children(struct mount *mnt, struct dentry *dentry) { struct mount *child; @@ -2389,12 +2390,8 @@ static bool __has_locked_children(struct mount *mnt, struct dentry *dentry) bool has_locked_children(struct mount *mnt, struct dentry *dentry) { - bool res; - - read_seqlock_excl(&mount_lock); - res = __has_locked_children(mnt, dentry); - read_sequnlock_excl(&mount_lock); - return res; + guard(mount_locked_reader)(); + return __has_locked_children(mnt, dentry); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 14/65] mnt_set_expiry(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (11 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 13/65] has_locked_children(): use guards Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 15/65] path_is_under(): " Al Viro ` (61 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds The reason why it needs only mount_locked_reader is that there's no lockless accesses of expiry lists. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 2cb3cb8307ca..db25c81d7f68 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3858,9 +3858,8 @@ int finish_automount(struct vfsmount *m, const struct path *path) */ void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list) { - read_seqlock_excl(&mount_lock); + guard(mount_locked_reader)(); list_add_tail(&real_mount(mnt)->mnt_expire, expiry_list); - read_sequnlock_excl(&mount_lock); } EXPORT_SYMBOL(mnt_set_expiry); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 15/65] path_is_under(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (12 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 14/65] mnt_set_expiry(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 16/65] current_chrooted(): don't bother with follow_down_one() Al Viro ` (60 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and document that locking requirements for is_path_reachable(). There is one questionable caller in do_listmount() where we are not holding mount_lock *and* might not have the first argument mounted. However, in that case it will immediately return true without having to look at the ancestors. Might be cleaner to move the check into non-LSTM_ROOT case which it really belongs in - there the check is not always true and is_mounted() is guaranteed. Document the locking environments for is_path_reachable() callers: get_peer_under_root() get_dominating_id() do_statmount() do_listmount() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 11 +++++------ fs/pnode.c | 3 ++- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index db25c81d7f68..6aabf0045389 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4592,7 +4592,7 @@ SYSCALL_DEFINE5(move_mount, /* * Return true if path is reachable from root * - * namespace_sem or mount_lock is held + * locks: mount_locked_reader || namespace_shared && is_mounted(mnt) */ bool is_path_reachable(struct mount *mnt, struct dentry *dentry, const struct path *root) @@ -4606,11 +4606,8 @@ bool is_path_reachable(struct mount *mnt, struct dentry *dentry, bool path_is_under(const struct path *path1, const struct path *path2) { - bool res; - read_seqlock_excl(&mount_lock); - res = is_path_reachable(real_mount(path1->mnt), path1->dentry, path2); - read_sequnlock_excl(&mount_lock); - return res; + guard(mount_locked_reader)(); + return is_path_reachable(real_mount(path1->mnt), path1->dentry, path2); } EXPORT_SYMBOL(path_is_under); @@ -5689,6 +5686,7 @@ static int grab_requested_root(struct mnt_namespace *ns, struct path *root) STATMOUNT_MNT_UIDMAP | \ STATMOUNT_MNT_GIDMAP) +/* locks: namespace_shared */ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id, struct mnt_namespace *ns) { @@ -5949,6 +5947,7 @@ SYSCALL_DEFINE4(statmount, const struct mnt_id_req __user *, req, return ret; } +/* locks: namespace_shared */ static ssize_t do_listmount(struct mnt_namespace *ns, u64 mnt_parent_id, u64 last_mnt_id, u64 *mnt_ids, size_t nr_mnt_ids, bool reverse) diff --git a/fs/pnode.c b/fs/pnode.c index 0702d45d856d..edaf9d9d0eaf 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -29,6 +29,7 @@ static inline struct mount *next_slave(struct mount *p) return hlist_entry(p->mnt_slave.next, struct mount, mnt_slave); } +/* locks: namespace_shared && is_mounted(mnt) */ static struct mount *get_peer_under_root(struct mount *mnt, struct mnt_namespace *ns, const struct path *root) @@ -50,7 +51,7 @@ static struct mount *get_peer_under_root(struct mount *mnt, * Get ID of closest dominating peer group having a representative * under the given root. * - * Caller must hold namespace_sem + * locks: namespace_shared */ int get_dominating_id(struct mount *mnt, const struct path *root) { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 16/65] current_chrooted(): don't bother with follow_down_one() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (13 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 15/65] path_is_under(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 17/65] current_chrooted(): use guards Al Viro ` (59 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds All we need here is to follow ->overmount on root mount of namespace... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 6aabf0045389..cf680fbf015e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6194,24 +6194,22 @@ bool our_mnt(struct vfsmount *mnt) bool current_chrooted(void) { /* Does the current process have a non-standard root */ - struct path ns_root; + struct mount *root = current->nsproxy->mnt_ns->root; struct path fs_root; bool chrooted; + get_fs_root(current->fs, &fs_root); + /* Find the namespace root */ - ns_root.mnt = ¤t->nsproxy->mnt_ns->root->mnt; - ns_root.dentry = ns_root.mnt->mnt_root; - path_get(&ns_root); - while (d_mountpoint(ns_root.dentry) && follow_down_one(&ns_root)) - ; + read_seqlock_excl(&mount_lock); - get_fs_root(current->fs, &fs_root); + while (unlikely(root->overmount)) + root = root->overmount; - chrooted = !path_equal(&fs_root, &ns_root); + chrooted = fs_root.mnt != &root->mnt || !path_mounted(&fs_root); + read_sequnlock_excl(&mount_lock); path_put(&fs_root); - path_put(&ns_root); - return chrooted; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 17/65] current_chrooted(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (14 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 16/65] current_chrooted(): don't bother with follow_down_one() Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 18/65] switch do_new_mount_fc() to fc_mount() Al Viro ` (58 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds here a use of __free(path_put) for dropping fs_root is enough to make guard(mount_locked_reader) fit... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index cf680fbf015e..0474b3a93dbf 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -6194,23 +6194,20 @@ bool our_mnt(struct vfsmount *mnt) bool current_chrooted(void) { /* Does the current process have a non-standard root */ - struct mount *root = current->nsproxy->mnt_ns->root; - struct path fs_root; - bool chrooted; + struct path fs_root __free(path_put) = {}; + struct mount *root; get_fs_root(current->fs, &fs_root); /* Find the namespace root */ - read_seqlock_excl(&mount_lock); + guard(mount_locked_reader)(); + + root = current->nsproxy->mnt_ns->root; while (unlikely(root->overmount)) root = root->overmount; - chrooted = fs_root.mnt != &root->mnt || !path_mounted(&fs_root); - - read_sequnlock_excl(&mount_lock); - path_put(&fs_root); - return chrooted; + return fs_root.mnt != &root->mnt || !path_mounted(&fs_root); } static bool mnt_already_visible(struct mnt_namespace *ns, -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 18/65] switch do_new_mount_fc() to fc_mount() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (15 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 17/65] current_chrooted(): use guards Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 19/65] do_move_mount(): trim local variables Al Viro ` (57 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Prior to the call of do_new_mount_fc() the caller has just done successful vfs_get_tree(). Then do_new_mount_fc() does several checks on resulting superblock, and either does fc_drop_locked() and returns an error or proceeds to unlock the superblock and call vfs_create_mount(). The thing is, there's no reason to delay that unlock + vfs_create_mount() - the tests do not rely upon the state of ->s_umount and fc_drop_locked() put_fs_context() is equivalent to unlock ->s_umount put_fs_context() Doing vfs_create_mount() before the checks allows us to move vfs_get_tree() from caller to do_new_mount_fc() and collapse it with vfs_create_mount() into an fc_mount() call. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 0474b3a93dbf..9b575c9eee0b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3705,25 +3705,20 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { - struct vfsmount *mnt; struct pinned_mountpoint mp = {}; - struct super_block *sb = fc->root->d_sb; + struct super_block *sb; + struct vfsmount *mnt = fc_mount(fc); int error; + if (IS_ERR(mnt)) + return PTR_ERR(mnt); + + sb = fc->root->d_sb; error = security_sb_kern_mount(sb); if (!error && mount_too_revealing(sb, &mnt_flags)) error = -EPERM; - - if (unlikely(error)) { - fc_drop_locked(fc); - return error; - } - - up_write(&sb->s_umount); - - mnt = vfs_create_mount(fc); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); + if (unlikely(error)) + goto out; mnt_warn_timestamp_expiry(mountpoint, mnt); @@ -3731,10 +3726,12 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, if (!error) { error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); + if (!error) + mnt = NULL; // consumed on success unlock_mount(&mp); } - if (error < 0) - mntput(mnt); +out: + mntput(mnt); return error; } @@ -3788,8 +3785,6 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, err = parse_monolithic_mount_data(fc, data); if (!err && !mount_capable(fc)) err = -EPERM; - if (!err) - err = vfs_get_tree(fc); if (!err) err = do_new_mount_fc(fc, path, mnt_flags); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 19/65] do_move_mount(): trim local variables 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (16 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 18/65] switch do_new_mount_fc() to fc_mount() Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 20/65] do_move_mount(): deal with the checks on old_path early Al Viro ` (56 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Both 'parent' and 'ns' are used at most once, no point precalculating those... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9b575c9eee0b..ad9b5687ff15 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3564,10 +3564,8 @@ static inline bool may_use_mount(struct mount *mnt) static int do_move_mount(struct path *old_path, struct path *new_path, enum mnt_tree_flags_t flags) { - struct mnt_namespace *ns; struct mount *p; struct mount *old; - struct mount *parent; struct pinned_mountpoint mp; int err; bool beneath = flags & MNT_TREE_BENEATH; @@ -3578,8 +3576,6 @@ static int do_move_mount(struct path *old_path, old = real_mount(old_path->mnt); p = real_mount(new_path->mnt); - parent = old->mnt_parent; - ns = old->mnt_ns; err = -EINVAL; @@ -3588,12 +3584,12 @@ static int do_move_mount(struct path *old_path, /* ... it should be detachable from parent */ if (!mnt_has_parent(old) || IS_MNT_LOCKED(old)) goto out; + /* ... which should not be shared */ + if (IS_MNT_SHARED(old->mnt_parent)) + goto out; /* ... and the target should be in our namespace */ if (!check_mnt(p)) goto out; - /* parent of the source should not be shared */ - if (IS_MNT_SHARED(parent)) - goto out; } else { /* * otherwise the source must be the root of some anon namespace. @@ -3605,7 +3601,7 @@ static int do_move_mount(struct path *old_path, * subsequent checks would've rejected that, but they lose * some corner cases if we check it early. */ - if (ns == p->mnt_ns) + if (old->mnt_ns == p->mnt_ns) goto out; /* * Target should be either in our namespace or in an acceptable -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 20/65] do_move_mount(): deal with the checks on old_path early 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (17 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 19/65] do_move_mount(): trim local variables Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 21/65] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro ` (55 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds 1) checking that location we want to move does point to root of some mount can be done before anything else; that property is not going to change and having it already verified simplifies the analysis. 2) checking the type agreement between what we are trying to move and what we are trying to move it onto also belongs in the very beginning - do_lock_mount() might end up switching new_path to something that overmounts the original location, but... the same type agreement applies to overmounts, so we could just as well check against the original location. 3) since we know that old_path->dentry is the root of old_path->mnt, there's no point bothering with path_is_overmounted() in can_move_mount_beneath(); it's simply a check for the mount we are trying to move having non-NULL ->overmount. And with that, we can switch can_move_mount_beneath() to taking old instead of old_path, leaving no uses of old_path past the original checks. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index ad9b5687ff15..74c67ea1b5a8 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3433,7 +3433,7 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) /** * can_move_mount_beneath - check that we can mount beneath the top mount - * @from: mount to mount beneath + * @mnt_from: mount we are trying to move * @to: mount under which to mount * @mp: mountpoint of @to * @@ -3443,7 +3443,7 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * root or the rootfs of the namespace. * - Make sure that the caller can unmount the topmost mount ensuring * that the caller could reveal the underlying mountpoint. - * - Ensure that nothing has been mounted on top of @from before we + * - Ensure that nothing has been mounted on top of @mnt_from before we * grabbed @namespace_sem to avoid creating pointless shadow mounts. * - Prevent mounting beneath a mount if the propagation relationship * between the source mount, parent mount, and top mount would lead to @@ -3452,12 +3452,11 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * Context: This function expects namespace_lock() to be held. * Return: On success 0, and on error a negative error code is returned. */ -static int can_move_mount_beneath(const struct path *from, +static int can_move_mount_beneath(struct mount *mnt_from, const struct path *to, const struct mountpoint *mp) { - struct mount *mnt_from = real_mount(from->mnt), - *mnt_to = real_mount(to->mnt), + struct mount *mnt_to = real_mount(to->mnt), *parent_mnt_to = mnt_to->mnt_parent; if (!mnt_has_parent(mnt_to)) @@ -3470,7 +3469,7 @@ static int can_move_mount_beneath(const struct path *from, return -EINVAL; /* Avoid creating shadow mounts during mount propagation. */ - if (path_overmounted(from)) + if (mnt_from->overmount) return -EINVAL; /* @@ -3565,16 +3564,21 @@ static int do_move_mount(struct path *old_path, struct path *new_path, enum mnt_tree_flags_t flags) { struct mount *p; - struct mount *old; + struct mount *old = real_mount(old_path->mnt); struct pinned_mountpoint mp; int err; bool beneath = flags & MNT_TREE_BENEATH; + if (!path_mounted(old_path)) + return -EINVAL; + + if (d_is_dir(new_path->dentry) != d_is_dir(old_path->dentry)) + return -EINVAL; + err = do_lock_mount(new_path, &mp, beneath); if (err) return err; - old = real_mount(old_path->mnt); p = real_mount(new_path->mnt); err = -EINVAL; @@ -3611,15 +3615,8 @@ static int do_move_mount(struct path *old_path, goto out; } - if (!path_mounted(old_path)) - goto out; - - if (d_is_dir(new_path->dentry) != - d_is_dir(old_path->dentry)) - goto out; - if (beneath) { - err = can_move_mount_beneath(old_path, new_path, mp.mp); + err = can_move_mount_beneath(old, new_path, mp.mp); if (err) goto out; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 21/65] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (18 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 20/65] do_move_mount(): deal with the checks on old_path early Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 22/65] finish_automount(): simplify the ELOOP check Al Viro ` (54 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds We want to mount beneath the given location. For that operation to make sense, location must be the root of some mount that has something under it. Currently we let it proceed if those requirements are not met, with rather meaningless results, and have that bogosity caught further down the road; let's fail early instead - do_lock_mount() doesn't make sense unless those conditions hold, and checking them there makes things simpler. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 74c67ea1b5a8..86c6dd432b13 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2768,12 +2768,19 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo struct path under = {}; int err = -ENOENT; + if (unlikely(beneath) && !path_mounted(path)) + return -EINVAL; + for (;;) { struct mount *m = real_mount(mnt); if (beneath) { path_put(&under); read_seqlock_excl(&mount_lock); + if (unlikely(!mnt_has_parent(m))) { + read_sequnlock_excl(&mount_lock); + return -EINVAL; + } under.mnt = mntget(&m->mnt_parent->mnt); under.dentry = dget(m->mnt_mountpoint); read_sequnlock_excl(&mount_lock); @@ -3437,8 +3444,6 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * @to: mount under which to mount * @mp: mountpoint of @to * - * - Make sure that @to->dentry is actually the root of a mount under - * which we can mount another mount. * - Make sure that nothing can be mounted beneath the caller's current * root or the rootfs of the namespace. * - Make sure that the caller can unmount the topmost mount ensuring @@ -3459,12 +3464,6 @@ static int can_move_mount_beneath(struct mount *mnt_from, struct mount *mnt_to = real_mount(to->mnt), *parent_mnt_to = mnt_to->mnt_parent; - if (!mnt_has_parent(mnt_to)) - return -EINVAL; - - if (!path_mounted(to)) - return -EINVAL; - if (IS_MNT_LOCKED(mnt_to)) return -EINVAL; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 22/65] finish_automount(): simplify the ELOOP check 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (19 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 21/65] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 23/65] do_loopback(): use __free(path_put) to deal with old_path Al Viro ` (53 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds It's enough to check that dentries match; if path->dentry is equal to m->mnt_root, superblocks will match as well. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 86c6dd432b13..bdb33270ac6e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3798,8 +3798,7 @@ int finish_automount(struct vfsmount *m, const struct path *path) mnt = real_mount(m); - if (m->mnt_sb == path->mnt->mnt_sb && - m->mnt_root == dentry) { + if (m->mnt_root == path->dentry) { err = -ELOOP; goto discard; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 23/65] do_loopback(): use __free(path_put) to deal with old_path 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (20 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 22/65] finish_automount(): simplify the ELOOP check Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 24/65] pivot_root(2): use __free() to deal with struct path in it Al Viro ` (52 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds preparations for making unlock_mount() a __cleanup(); can't have path_put() inside mount_lock scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index bdb33270ac6e..245cf2d19a6b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3014,7 +3014,7 @@ static struct mount *__do_loopback(struct path *old_path, int recurse) static int do_loopback(struct path *path, const char *old_name, int recurse) { - struct path old_path; + struct path old_path __free(path_put) = {}; struct mount *mnt = NULL, *parent; struct pinned_mountpoint mp = {}; int err; @@ -3024,13 +3024,12 @@ static int do_loopback(struct path *path, const char *old_name, if (err) return err; - err = -EINVAL; if (mnt_ns_loop(old_path.dentry)) - goto out; + return -EINVAL; err = lock_mount(path, &mp); if (err) - goto out; + return err; parent = real_mount(path->mnt); if (!check_mnt(parent)) @@ -3050,8 +3049,6 @@ static int do_loopback(struct path *path, const char *old_name, } out2: unlock_mount(&mp); -out: - path_put(&old_path); return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 24/65] pivot_root(2): use __free() to deal with struct path in it 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (21 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 23/65] do_loopback(): use __free(path_put) to deal with old_path Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 25/65] finish_automount(): take the lock_mount() analogue into a helper Al Viro ` (51 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds preparations for making unlock_mount() a __cleanup(); can't have path_put() inside mount_lock scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 245cf2d19a6b..90b62ee882da 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4622,7 +4622,9 @@ EXPORT_SYMBOL(path_is_under); SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, const char __user *, put_old) { - struct path new, old, root; + struct path new __free(path_put) = {}; + struct path old __free(path_put) = {}; + struct path root __free(path_put) = {}; struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent; struct pinned_mountpoint old_mp = {}; int error; @@ -4633,21 +4635,21 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, error = user_path_at(AT_FDCWD, new_root, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new); if (error) - goto out0; + return error; error = user_path_at(AT_FDCWD, put_old, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old); if (error) - goto out1; + return error; error = security_sb_pivotroot(&old, &new); if (error) - goto out2; + return error; get_fs_root(current->fs, &root); error = lock_mount(&old, &old_mp); if (error) - goto out3; + return error; error = -EINVAL; new_mnt = real_mount(new.mnt); @@ -4705,13 +4707,6 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, error = 0; out4: unlock_mount(&old_mp); -out3: - path_put(&root); -out2: - path_put(&old); -out1: - path_put(&new); -out0: return error; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 25/65] finish_automount(): take the lock_mount() analogue into a helper 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (22 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 24/65] pivot_root(2): use __free() to deal with struct path in it Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 26/65] do_new_mount_fc(): use __free() to deal with dropping mnt on failure Al Viro ` (50 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds finish_automount() can't use lock_mount() - it treats finding something already mounted as "quitely drop our mount and return 0", not as "mount on top of whatever mounted there". It's been open-coded; let's take it into a helper similar to lock_mount(). "something's already mounted" => -EBUSY, finish_automount() needs to distinguish it from the normal case and it can't happen in other failure cases. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 42 +++++++++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 90b62ee882da..6251ee15f5f6 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3781,9 +3781,29 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, return err; } -int finish_automount(struct vfsmount *m, const struct path *path) +static int lock_mount_exact(const struct path *path, + struct pinned_mountpoint *mp) { struct dentry *dentry = path->dentry; + int err; + + inode_lock(dentry->d_inode); + namespace_lock(); + if (unlikely(cant_mount(dentry))) + err = -ENOENT; + else if (path_overmounted(path)) + err = -EBUSY; + else + err = get_mountpoint(dentry, mp); + if (unlikely(err)) { + namespace_unlock(); + inode_unlock(dentry->d_inode); + } + return err; +} + +int finish_automount(struct vfsmount *m, const struct path *path) +{ struct pinned_mountpoint mp = {}; struct mount *mnt; int err; @@ -3805,20 +3825,11 @@ int finish_automount(struct vfsmount *m, const struct path *path) * that overmounts our mountpoint to be means "quitely drop what we've * got", not "try to mount it on top". */ - inode_lock(dentry->d_inode); - namespace_lock(); - if (unlikely(cant_mount(dentry))) { - err = -ENOENT; - goto discard_locked; - } - if (path_overmounted(path)) { - err = 0; - goto discard_locked; + err = lock_mount_exact(path, &mp); + if (unlikely(err)) { + mntput(m); + return err == -EBUSY ? 0 : err; } - err = get_mountpoint(dentry, &mp); - if (err) - goto discard_locked; - err = do_add_mount(mnt, mp.mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); unlock_mount(&mp); @@ -3826,9 +3837,6 @@ int finish_automount(struct vfsmount *m, const struct path *path) goto discard; return 0; -discard_locked: - namespace_unlock(); - inode_unlock(dentry->d_inode); discard: mntput(m); return err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 26/65] do_new_mount_fc(): use __free() to deal with dropping mnt on failure 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (23 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 25/65] finish_automount(): take the lock_mount() analogue into a helper Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 26/63] do_new_mount_rc(): " Al Viro ` (49 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds do_add_mount() consumes vfsmount on success; just follow it with conditional retain_and_null_ptr() on success and we can switch to __free() for mnt and be done with that - unlock_mount() is in the very end. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 6251ee15f5f6..3551e51461a2 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3696,7 +3696,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, { struct pinned_mountpoint mp = {}; struct super_block *sb; - struct vfsmount *mnt = fc_mount(fc); + struct vfsmount *mnt __free(mntput) = fc_mount(fc); int error; if (IS_ERR(mnt)) @@ -3704,10 +3704,11 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, sb = fc->root->d_sb; error = security_sb_kern_mount(sb); - if (!error && mount_too_revealing(sb, &mnt_flags)) - error = -EPERM; if (unlikely(error)) - goto out; + return error; + + if (unlikely(mount_too_revealing(sb, &mnt_flags))) + return -EPERM; mnt_warn_timestamp_expiry(mountpoint, mnt); @@ -3716,11 +3717,9 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); if (!error) - mnt = NULL; // consumed on success + retain_and_null_ptr(mnt); // consumed on success unlock_mount(&mp); } -out: - mntput(mnt); return error; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 26/63] do_new_mount_rc(): use __free() to deal with dropping mnt on failure 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (24 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 26/65] do_new_mount_fc(): use __free() to deal with dropping mnt on failure Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 27/65] finish_automount(): " Al Viro ` (48 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds do_add_mount() consumes vfsmount on success; just follow it with conditional retain_and_null_ptr() on success and we can switch to __free() for mnt and be done with that - unlock_mount() is in the very end. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 6251ee15f5f6..3551e51461a2 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3696,7 +3696,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, { struct pinned_mountpoint mp = {}; struct super_block *sb; - struct vfsmount *mnt = fc_mount(fc); + struct vfsmount *mnt __free(mntput) = fc_mount(fc); int error; if (IS_ERR(mnt)) @@ -3704,10 +3704,11 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, sb = fc->root->d_sb; error = security_sb_kern_mount(sb); - if (!error && mount_too_revealing(sb, &mnt_flags)) - error = -EPERM; if (unlikely(error)) - goto out; + return error; + + if (unlikely(mount_too_revealing(sb, &mnt_flags))) + return -EPERM; mnt_warn_timestamp_expiry(mountpoint, mnt); @@ -3716,11 +3717,9 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); if (!error) - mnt = NULL; // consumed on success + retain_and_null_ptr(mnt); // consumed on success unlock_mount(&mp); } -out: - mntput(mnt); return error; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 27/65] finish_automount(): use __free() to deal with dropping mnt on failure 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (25 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 26/63] do_new_mount_rc(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 28/65] change calling conventions for lock_mount() et.al Al Viro ` (47 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds same story as with do_new_mount_fc(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 3551e51461a2..779cfed04291 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3801,8 +3801,9 @@ static int lock_mount_exact(const struct path *path, return err; } -int finish_automount(struct vfsmount *m, const struct path *path) +int finish_automount(struct vfsmount *__m, const struct path *path) { + struct vfsmount *m __free(mntput) = __m; struct pinned_mountpoint mp = {}; struct mount *mnt; int err; @@ -3814,10 +3815,8 @@ int finish_automount(struct vfsmount *m, const struct path *path) mnt = real_mount(m); - if (m->mnt_root == path->dentry) { - err = -ELOOP; - goto discard; - } + if (m->mnt_root == path->dentry) + return -ELOOP; /* * we don't want to use lock_mount() - in this case finding something @@ -3825,19 +3824,14 @@ int finish_automount(struct vfsmount *m, const struct path *path) * got", not "try to mount it on top". */ err = lock_mount_exact(path, &mp); - if (unlikely(err)) { - mntput(m); + if (unlikely(err)) return err == -EBUSY ? 0 : err; - } + err = do_add_mount(mnt, mp.mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); + if (likely(!err)) + retain_and_null_ptr(m); unlock_mount(&mp); - if (unlikely(err)) - goto discard; - return 0; - -discard: - mntput(m); return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 28/65] change calling conventions for lock_mount() et.al. 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (26 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 27/65] finish_automount(): " Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 29/65] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro ` (46 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds 1) pinned_mountpoint gets a new member - struct mount *parent. Set only if we locked the sucker; ERR_PTR() - on failed attempt. 2) do_lock_mount() et.al. return void and set ->parent to * on success with !beneath - mount corresponding to path->mnt * on success with beneath - the parent of mount corresponding to path->mnt * in case of error - ERR_PTR(-E...). IOW, we get the mount we will be actually mounting upon or ERR_PTR(). 3) we can't use CLASS, since the pinned_mountpoint is placed on hlist during initialization, so we define local macros: LOCK_MOUNT(mp, path) LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) LOCK_MOUNT_EXACT(mp, path) All of them declare and initialize struct pinned_mountpoint mp, with unlock_mount done via __cleanup(). Users converted. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 219 ++++++++++++++++++++++++------------------------- 1 file changed, 108 insertions(+), 111 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 779cfed04291..952e66bdb9bb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -919,6 +919,7 @@ bool __is_local_mountpoint(const struct dentry *dentry) struct pinned_mountpoint { struct hlist_node node; struct mountpoint *mp; + struct mount *parent; }; static bool lookup_mountpoint(struct dentry *dentry, struct pinned_mountpoint *m) @@ -2728,48 +2729,47 @@ static int attach_recursive_mnt(struct mount *source_mnt, } /** - * do_lock_mount - lock mount and mountpoint - * @path: target path - * @beneath: whether the intention is to mount beneath @path + * do_lock_mount - acquire environment for mounting + * @path: target path + * @res: context to set up + * @beneath: whether the intention is to mount beneath @path * - * Follow the mount stack on @path until the top mount @mnt is found. If - * the initial @path->{mnt,dentry} is a mountpoint lookup the first - * mount stacked on top of it. Then simply follow @{mnt,mnt->mnt_root} - * until nothing is stacked on top of it anymore. + * To mount something at given location, we need + * namespace_sem locked exclusive + * inode of dentry we are mounting on locked exclusive + * struct mountpoint for that dentry + * struct mount we are mounting on * - * Acquire the inode_lock() on the top mount's ->mnt_root to protect - * against concurrent removal of the new mountpoint from another mount - * namespace. + * Results are stored in caller-supplied context (pinned_mountpoint); + * on success we have res->parent and res->mp pointing to parent and + * mountpoint respectively and res->node inserted into the ->m_list + * of the mountpoint, making sure the mountpoint won't disappear. + * On failure we have res->parent set to ERR_PTR(-E...), res->mp + * left NULL, res->node - empty. + * In case of success do_lock_mount returns with locks acquired (in + * proper order - inode lock nests outside of namespace_sem). * - * If @beneath is requested, acquire inode_lock() on @mnt's mountpoint - * @mp on @mnt->mnt_parent must be acquired. This protects against a - * concurrent unlink of @mp->mnt_dentry from another mount namespace - * where @mnt doesn't have a child mount mounted @mp. A concurrent - * removal of @mnt->mnt_root doesn't matter as nothing will be mounted - * on top of it for @beneath. + * Request to mount on overmounted location is treated as "mount on + * top of whatever's overmounting it"; request to mount beneath + * a location - "mount immediately beneath the topmost mount at that + * place". * - * In addition, @beneath needs to make sure that @mnt hasn't been - * unmounted or moved from its current mountpoint in between dropping - * @mount_lock and acquiring @namespace_sem. For the !@beneath case @mnt - * being unmounted would be detected later by e.g., calling - * check_mnt(mnt) in the function it's called from. For the @beneath - * case however, it's useful to detect it directly in do_lock_mount(). - * If @mnt hasn't been unmounted then @mnt->mnt_mountpoint still points - * to @mnt->mnt_mp->m_dentry. But if @mnt has been unmounted it will - * point to @mnt->mnt_root and @mnt->mnt_mp will be NULL. - * - * Return: Either the target mountpoint on the top mount or the top - * mount's mountpoint. + * In all cases the location must not have been unmounted and the + * chosen mountpoint must be allowed to be mounted on. For "beneath" + * case we also require the location to be at the root of a mount + * that has a parent (i.e. is not a root of some namespace). */ -static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bool beneath) +static void do_lock_mount(struct path *path, struct pinned_mountpoint *res, bool beneath) { struct vfsmount *mnt = path->mnt; struct dentry *dentry; struct path under = {}; int err = -ENOENT; - if (unlikely(beneath) && !path_mounted(path)) - return -EINVAL; + if (unlikely(beneath) && !path_mounted(path)) { + res->parent = ERR_PTR(-EINVAL); + return; + } for (;;) { struct mount *m = real_mount(mnt); @@ -2779,7 +2779,8 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo read_seqlock_excl(&mount_lock); if (unlikely(!mnt_has_parent(m))) { read_sequnlock_excl(&mount_lock); - return -EINVAL; + res->parent = ERR_PTR(-EINVAL); + return; } under.mnt = mntget(&m->mnt_parent->mnt); under.dentry = dget(m->mnt_mountpoint); @@ -2811,7 +2812,7 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo path->dentry = dget(mnt->mnt_root); continue; // got overmounted } - err = get_mountpoint(dentry, pinned); + err = get_mountpoint(dentry, res); if (err) break; if (beneath) { @@ -2822,22 +2823,25 @@ static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bo * we are not dropping the final references here). */ path_put(&under); + res->parent = real_mount(path->mnt)->mnt_parent; + return; } - return 0; + res->parent = real_mount(path->mnt); + return; } namespace_unlock(); inode_unlock(dentry->d_inode); if (beneath) path_put(&under); - return err; + res->parent = ERR_PTR(err); } -static inline int lock_mount(struct path *path, struct pinned_mountpoint *m) +static inline void lock_mount(struct path *path, struct pinned_mountpoint *m) { - return do_lock_mount(path, m, false); + do_lock_mount(path, m, false); } -static void unlock_mount(struct pinned_mountpoint *m) +static void __unlock_mount(struct pinned_mountpoint *m) { inode_unlock(m->mp->m_dentry->d_inode); read_seqlock_excl(&mount_lock); @@ -2846,6 +2850,20 @@ static void unlock_mount(struct pinned_mountpoint *m) namespace_unlock(); } +static inline void unlock_mount(struct pinned_mountpoint *m) +{ + if (!IS_ERR(m->parent)) + __unlock_mount(m); +} + +#define LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) \ + struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \ + do_lock_mount((path), &mp, (beneath)) +#define LOCK_MOUNT(mp, path) LOCK_MOUNT_MAYBE_BENEATH(mp, (path), false) +#define LOCK_MOUNT_EXACT(mp, path) \ + struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \ + lock_mount_exact((path), &mp) + static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) { if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER) @@ -3015,8 +3033,7 @@ static int do_loopback(struct path *path, const char *old_name, int recurse) { struct path old_path __free(path_put) = {}; - struct mount *mnt = NULL, *parent; - struct pinned_mountpoint mp = {}; + struct mount *mnt = NULL; int err; if (!old_name || !*old_name) return -EINVAL; @@ -3027,28 +3044,23 @@ static int do_loopback(struct path *path, const char *old_name, if (mnt_ns_loop(old_path.dentry)) return -EINVAL; - err = lock_mount(path, &mp); - if (err) - return err; + LOCK_MOUNT(mp, path); + if (IS_ERR(mp.parent)) + return PTR_ERR(mp.parent); - parent = real_mount(path->mnt); - if (!check_mnt(parent)) - goto out2; + if (!check_mnt(mp.parent)) + return -EINVAL; mnt = __do_loopback(&old_path, recurse); - if (IS_ERR(mnt)) { - err = PTR_ERR(mnt); - goto out2; - } + if (IS_ERR(mnt)) + return PTR_ERR(mnt); - err = graft_tree(mnt, parent, mp.mp); + err = graft_tree(mnt, mp.parent, mp.mp); if (err) { lock_mount_hash(); umount_tree(mnt, UMOUNT_SYNC); unlock_mount_hash(); } -out2: - unlock_mount(&mp); return err; } @@ -3561,7 +3573,6 @@ static int do_move_mount(struct path *old_path, { struct mount *p; struct mount *old = real_mount(old_path->mnt); - struct pinned_mountpoint mp; int err; bool beneath = flags & MNT_TREE_BENEATH; @@ -3571,52 +3582,49 @@ static int do_move_mount(struct path *old_path, if (d_is_dir(new_path->dentry) != d_is_dir(old_path->dentry)) return -EINVAL; - err = do_lock_mount(new_path, &mp, beneath); - if (err) - return err; + LOCK_MOUNT_MAYBE_BENEATH(mp, new_path, beneath); + if (IS_ERR(mp.parent)) + return PTR_ERR(mp.parent); p = real_mount(new_path->mnt); - err = -EINVAL; - if (check_mnt(old)) { /* if the source is in our namespace... */ /* ... it should be detachable from parent */ if (!mnt_has_parent(old) || IS_MNT_LOCKED(old)) - goto out; + return -EINVAL; /* ... which should not be shared */ if (IS_MNT_SHARED(old->mnt_parent)) - goto out; + return -EINVAL; /* ... and the target should be in our namespace */ if (!check_mnt(p)) - goto out; + return -EINVAL; } else { /* * otherwise the source must be the root of some anon namespace. */ if (!anon_ns_root(old)) - goto out; + return -EINVAL; /* * Bail out early if the target is within the same namespace - * subsequent checks would've rejected that, but they lose * some corner cases if we check it early. */ if (old->mnt_ns == p->mnt_ns) - goto out; + return -EINVAL; /* * Target should be either in our namespace or in an acceptable * anon namespace, sensu check_anonymous_mnt(). */ if (!may_use_mount(p)) - goto out; + return -EINVAL; } if (beneath) { err = can_move_mount_beneath(old, new_path, mp.mp); if (err) - goto out; + return err; - err = -EINVAL; p = p->mnt_parent; } @@ -3625,17 +3633,13 @@ static int do_move_mount(struct path *old_path, * mount which is shared. */ if (IS_MNT_SHARED(p) && tree_contains_unbindable(old)) - goto out; - err = -ELOOP; + return -EINVAL; if (!check_for_nsfs_mounts(old)) - goto out; + return -ELOOP; if (mount_is_ancestor(old, p)) - goto out; + return -ELOOP; - err = attach_recursive_mnt(old, p, mp.mp); -out: - unlock_mount(&mp); - return err; + return attach_recursive_mnt(old, p, mp.mp); } static int do_move_mount_old(struct path *path, const char *old_name) @@ -3694,7 +3698,6 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, unsigned int mnt_flags) { - struct pinned_mountpoint mp = {}; struct super_block *sb; struct vfsmount *mnt __free(mntput) = fc_mount(fc); int error; @@ -3712,13 +3715,14 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, mnt_warn_timestamp_expiry(mountpoint, mnt); - error = lock_mount(mountpoint, &mp); - if (!error) { + LOCK_MOUNT(mp, mountpoint); + if (IS_ERR(mp.parent)) { + return PTR_ERR(mp.parent); + } else { error = do_add_mount(real_mount(mnt), mp.mp, mountpoint, mnt_flags); if (!error) retain_and_null_ptr(mnt); // consumed on success - unlock_mount(&mp); } return error; } @@ -3780,8 +3784,8 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags, return err; } -static int lock_mount_exact(const struct path *path, - struct pinned_mountpoint *mp) +static void lock_mount_exact(const struct path *path, + struct pinned_mountpoint *mp) { struct dentry *dentry = path->dentry; int err; @@ -3797,14 +3801,15 @@ static int lock_mount_exact(const struct path *path, if (unlikely(err)) { namespace_unlock(); inode_unlock(dentry->d_inode); + mp->parent = ERR_PTR(err); + } else { + mp->parent = real_mount(path->mnt); } - return err; } int finish_automount(struct vfsmount *__m, const struct path *path) { struct vfsmount *m __free(mntput) = __m; - struct pinned_mountpoint mp = {}; struct mount *mnt; int err; @@ -3823,15 +3828,14 @@ int finish_automount(struct vfsmount *__m, const struct path *path) * that overmounts our mountpoint to be means "quitely drop what we've * got", not "try to mount it on top". */ - err = lock_mount_exact(path, &mp); - if (unlikely(err)) - return err == -EBUSY ? 0 : err; + LOCK_MOUNT_EXACT(mp, path); + if (IS_ERR(mp.parent)) + return mp.parent == ERR_PTR(-EBUSY) ? 0 : PTR_ERR(mp.parent); err = do_add_mount(mnt, mp.mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE); if (likely(!err)) retain_and_null_ptr(m); - unlock_mount(&mp); return err; } @@ -4627,7 +4631,6 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, struct path old __free(path_put) = {}; struct path root __free(path_put) = {}; struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent; - struct pinned_mountpoint old_mp = {}; int error; if (!may_mount()) @@ -4648,45 +4651,42 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, return error; get_fs_root(current->fs, &root); - error = lock_mount(&old, &old_mp); - if (error) - return error; - error = -EINVAL; + LOCK_MOUNT(old_mp, &old); + old_mnt = old_mp.parent; + if (IS_ERR(old_mnt)) + return PTR_ERR(old_mnt); + new_mnt = real_mount(new.mnt); root_mnt = real_mount(root.mnt); - old_mnt = real_mount(old.mnt); ex_parent = new_mnt->mnt_parent; root_parent = root_mnt->mnt_parent; if (IS_MNT_SHARED(old_mnt) || IS_MNT_SHARED(ex_parent) || IS_MNT_SHARED(root_parent)) - goto out4; + return -EINVAL; if (!check_mnt(root_mnt) || !check_mnt(new_mnt)) - goto out4; + return -EINVAL; if (new_mnt->mnt.mnt_flags & MNT_LOCKED) - goto out4; - error = -ENOENT; + return -EINVAL; if (d_unlinked(new.dentry)) - goto out4; - error = -EBUSY; + return -ENOENT; if (new_mnt == root_mnt || old_mnt == root_mnt) - goto out4; /* loop, on the same file system */ - error = -EINVAL; + return -EBUSY; /* loop, on the same file system */ if (!path_mounted(&root)) - goto out4; /* not a mountpoint */ + return -EINVAL; /* not a mountpoint */ if (!mnt_has_parent(root_mnt)) - goto out4; /* absolute root */ + return -EINVAL; /* absolute root */ if (!path_mounted(&new)) - goto out4; /* not a mountpoint */ + return -EINVAL; /* not a mountpoint */ if (!mnt_has_parent(new_mnt)) - goto out4; /* absolute root */ + return -EINVAL; /* absolute root */ /* make sure we can reach put_old from new_root */ if (!is_path_reachable(old_mnt, old.dentry, &new)) - goto out4; + return -EINVAL; /* make certain new is below the root */ if (!is_path_reachable(new_mnt, new.dentry, &root)) - goto out4; + return -EINVAL; lock_mount_hash(); umount_mnt(new_mnt); if (root_mnt->mnt.mnt_flags & MNT_LOCKED) { @@ -4705,10 +4705,7 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, mnt_notify_add(root_mnt); mnt_notify_add(new_mnt); chroot_fs_refs(&root, &new); - error = 0; -out4: - unlock_mount(&old_mp); - return error; + return 0; } static unsigned int recalc_flags(struct mount_kattr *kattr, struct mount *mnt) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 29/65] do_move_mount(): use the parent mount returned by do_lock_mount() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (27 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 28/65] change calling conventions for lock_mount() et.al Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 30/65] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro ` (45 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds After successful do_lock_mount() call, mp.parent is set to either real_mount(path->mnt) (for !beneath case) or to ->mnt_parent of that (for beneath). p is set to real_mount(path->mnt) and after several uses it's made equal to mp.parent. All uses prior to that care only about p->mnt_ns and since p->mnt_ns == parent->mnt_ns, we might as well use mp.parent all along. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 952e66bdb9bb..d57e727962da 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3571,7 +3571,6 @@ static inline bool may_use_mount(struct mount *mnt) static int do_move_mount(struct path *old_path, struct path *new_path, enum mnt_tree_flags_t flags) { - struct mount *p; struct mount *old = real_mount(old_path->mnt); int err; bool beneath = flags & MNT_TREE_BENEATH; @@ -3586,8 +3585,6 @@ static int do_move_mount(struct path *old_path, if (IS_ERR(mp.parent)) return PTR_ERR(mp.parent); - p = real_mount(new_path->mnt); - if (check_mnt(old)) { /* if the source is in our namespace... */ /* ... it should be detachable from parent */ @@ -3597,7 +3594,7 @@ static int do_move_mount(struct path *old_path, if (IS_MNT_SHARED(old->mnt_parent)) return -EINVAL; /* ... and the target should be in our namespace */ - if (!check_mnt(p)) + if (!check_mnt(mp.parent)) return -EINVAL; } else { /* @@ -3610,13 +3607,13 @@ static int do_move_mount(struct path *old_path, * subsequent checks would've rejected that, but they lose * some corner cases if we check it early. */ - if (old->mnt_ns == p->mnt_ns) + if (old->mnt_ns == mp.parent->mnt_ns) return -EINVAL; /* * Target should be either in our namespace or in an acceptable * anon namespace, sensu check_anonymous_mnt(). */ - if (!may_use_mount(p)) + if (!may_use_mount(mp.parent)) return -EINVAL; } @@ -3624,22 +3621,20 @@ static int do_move_mount(struct path *old_path, err = can_move_mount_beneath(old, new_path, mp.mp); if (err) return err; - - p = p->mnt_parent; } /* * Don't move a mount tree containing unbindable mounts to a destination * mount which is shared. */ - if (IS_MNT_SHARED(p) && tree_contains_unbindable(old)) + if (IS_MNT_SHARED(mp.parent) && tree_contains_unbindable(old)) return -EINVAL; if (!check_for_nsfs_mounts(old)) return -ELOOP; - if (mount_is_ancestor(old, p)) + if (mount_is_ancestor(old, mp.parent)) return -ELOOP; - return attach_recursive_mnt(old, p, mp.mp); + return attach_recursive_mnt(old, mp.parent, mp.mp); } static int do_move_mount_old(struct path *path, const char *old_name) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 30/65] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (28 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 29/65] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 31/65] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro ` (44 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Both callers pass it a mountpoint reference picked from pinned_mountpoint and path it corresponds to. First of all, path->dentry is equal to mp.mp->m_dentry. Furthermore, path->mnt is &mp.parent->mnt, making struct path contents redundant. Pass it the address of that pinned_mountpoint instead; what's more, if we teach it to treat ERR_PTR(error) in ->parent as "bail out with that error" we can simplify the callers even more - do_add_mount() will do the right thing even when called after lock_mount() failure. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 32 +++++++++++++++----------------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index d57e727962da..b236536bbbc9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3657,10 +3657,13 @@ static int do_move_mount_old(struct path *path, const char *old_name) /* * add a mount into a namespace's mount tree */ -static int do_add_mount(struct mount *newmnt, struct mountpoint *mp, - const struct path *path, int mnt_flags) +static int do_add_mount(struct mount *newmnt, const struct pinned_mountpoint *mp, + int mnt_flags) { - struct mount *parent = real_mount(path->mnt); + struct mount *parent = mp->parent; + + if (IS_ERR(parent)) + return PTR_ERR(parent); mnt_flags &= ~MNT_INTERNAL_FLAGS; @@ -3674,14 +3677,15 @@ static int do_add_mount(struct mount *newmnt, struct mountpoint *mp, } /* Refuse the same filesystem on the same mount point */ - if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path_mounted(path)) + if (parent->mnt.mnt_sb == newmnt->mnt.mnt_sb && + parent->mnt.mnt_root == mp->mp->m_dentry) return -EBUSY; if (d_is_symlink(newmnt->mnt.mnt_root)) return -EINVAL; newmnt->mnt.mnt_flags = mnt_flags; - return graft_tree(newmnt, parent, mp); + return graft_tree(newmnt, parent, mp->mp); } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags); @@ -3711,14 +3715,9 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, mnt_warn_timestamp_expiry(mountpoint, mnt); LOCK_MOUNT(mp, mountpoint); - if (IS_ERR(mp.parent)) { - return PTR_ERR(mp.parent); - } else { - error = do_add_mount(real_mount(mnt), mp.mp, - mountpoint, mnt_flags); - if (!error) - retain_and_null_ptr(mnt); // consumed on success - } + error = do_add_mount(real_mount(mnt), &mp, mnt_flags); + if (!error) + retain_and_null_ptr(mnt); // consumed on success return error; } @@ -3824,11 +3823,10 @@ int finish_automount(struct vfsmount *__m, const struct path *path) * got", not "try to mount it on top". */ LOCK_MOUNT_EXACT(mp, path); - if (IS_ERR(mp.parent)) - return mp.parent == ERR_PTR(-EBUSY) ? 0 : PTR_ERR(mp.parent); + if (mp.parent == ERR_PTR(-EBUSY)) + return 0; - err = do_add_mount(mnt, mp.mp, path, - path->mnt->mnt_flags | MNT_SHRINKABLE); + err = do_add_mount(mnt, &mp, path->mnt->mnt_flags | MNT_SHRINKABLE); if (likely(!err)) retain_and_null_ptr(m); return err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 31/65] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (29 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 30/65] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 32/65] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro ` (43 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds parent and mountpoint always come from the same struct pinned_mountpoint now. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index b236536bbbc9..18d6ad0f4f76 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2549,8 +2549,7 @@ enum mnt_tree_flags_t { /** * attach_recursive_mnt - attach a source mount tree * @source_mnt: mount tree to be attached - * @dest_mnt: mount that @source_mnt will be mounted on - * @dest_mp: the mountpoint @source_mnt will be mounted at + * @dest: the context for mounting at the place where the tree should go * * NOTE: in the table below explains the semantics when a source mount * of a given type is attached to a destination mount of a given type. @@ -2613,10 +2612,11 @@ enum mnt_tree_flags_t { * Otherwise a negative error code is returned. */ static int attach_recursive_mnt(struct mount *source_mnt, - struct mount *dest_mnt, - struct mountpoint *dest_mp) + const struct pinned_mountpoint *dest) { struct user_namespace *user_ns = current->nsproxy->mnt_ns->user_ns; + struct mount *dest_mnt = dest->parent; + struct mountpoint *dest_mp = dest->mp; HLIST_HEAD(tree_list); struct mnt_namespace *ns = dest_mnt->mnt_ns; struct pinned_mountpoint root = {}; @@ -2864,16 +2864,16 @@ static inline void unlock_mount(struct pinned_mountpoint *m) struct pinned_mountpoint mp __cleanup(unlock_mount) = {}; \ lock_mount_exact((path), &mp) -static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) +static int graft_tree(struct mount *mnt, const struct pinned_mountpoint *mp) { if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER) return -EINVAL; - if (d_is_dir(mp->m_dentry) != + if (d_is_dir(mp->mp->m_dentry) != d_is_dir(mnt->mnt.mnt_root)) return -ENOTDIR; - return attach_recursive_mnt(mnt, p, mp); + return attach_recursive_mnt(mnt, mp); } static int may_change_propagation(const struct mount *m) @@ -3055,7 +3055,7 @@ static int do_loopback(struct path *path, const char *old_name, if (IS_ERR(mnt)) return PTR_ERR(mnt); - err = graft_tree(mnt, mp.parent, mp.mp); + err = graft_tree(mnt, &mp); if (err) { lock_mount_hash(); umount_tree(mnt, UMOUNT_SYNC); @@ -3634,7 +3634,7 @@ static int do_move_mount(struct path *old_path, if (mount_is_ancestor(old, mp.parent)) return -ELOOP; - return attach_recursive_mnt(old, mp.parent, mp.mp); + return attach_recursive_mnt(old, &mp); } static int do_move_mount_old(struct path *path, const char *old_name) @@ -3685,7 +3685,7 @@ static int do_add_mount(struct mount *newmnt, const struct pinned_mountpoint *mp return -EINVAL; newmnt->mnt.mnt_flags = mnt_flags; - return graft_tree(newmnt, parent, mp->mp); + return graft_tree(newmnt, mp); } static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 32/65] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (30 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 31/65] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 33/65] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro ` (42 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds That kills the last place where callers of lock_mount(path, &mp) used path->dentry. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 18d6ad0f4f76..02bc5294071a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4675,7 +4675,7 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, if (!mnt_has_parent(new_mnt)) return -EINVAL; /* absolute root */ /* make sure we can reach put_old from new_root */ - if (!is_path_reachable(old_mnt, old.dentry, &new)) + if (!is_path_reachable(old_mnt, old_mp.mp->m_dentry, &new)) return -EINVAL; /* make certain new is below the root */ if (!is_path_reachable(new_mnt, new.dentry, &root)) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 33/65] don't bother passing new_path->dentry to can_move_mount_beneath() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (31 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 32/65] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 34/65] new helper: topmost_overmount() Al Viro ` (41 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 02bc5294071a..b81677a4232f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3450,8 +3450,8 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) /** * can_move_mount_beneath - check that we can mount beneath the top mount * @mnt_from: mount we are trying to move - * @to: mount under which to mount - * @mp: mountpoint of @to + * @mnt_to: mount under which to mount + * @mp: mountpoint of @mnt_to * * - Make sure that nothing can be mounted beneath the caller's current * root or the rootfs of the namespace. @@ -3467,11 +3467,10 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * Return: On success 0, and on error a negative error code is returned. */ static int can_move_mount_beneath(struct mount *mnt_from, - const struct path *to, + struct mount *mnt_to, const struct mountpoint *mp) { - struct mount *mnt_to = real_mount(to->mnt), - *parent_mnt_to = mnt_to->mnt_parent; + struct mount *parent_mnt_to = mnt_to->mnt_parent; if (IS_MNT_LOCKED(mnt_to)) return -EINVAL; @@ -3618,7 +3617,9 @@ static int do_move_mount(struct path *old_path, } if (beneath) { - err = can_move_mount_beneath(old, new_path, mp.mp); + struct mount *over = real_mount(new_path->mnt); + + err = can_move_mount_beneath(old, over, mp.mp); if (err) return err; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 34/65] new helper: topmost_overmount() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (32 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 33/65] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 35/65] do_lock_mount(): don't modify path Al Viro ` (40 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Returns the final (topmost) mount in the chain of overmounts starting at given mount. Same locking rules as for any mount tree traversal - either the spinlock side of mount_lock, or rcu + sample the seqcount side of mount_lock before the call and recheck afterwards. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 7 +++++++ fs/namespace.c | 9 +++------ 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index ed8c83ba836a..04d0eadc4c10 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -235,4 +235,11 @@ static inline void mnt_notify_add(struct mount *m) } #endif +static inline struct mount *topmost_overmount(struct mount *m) +{ + while (m->overmount) + m = m->overmount; + return m; +} + struct mnt_namespace *mnt_ns_from_dentry(struct dentry *dentry); diff --git a/fs/namespace.c b/fs/namespace.c index b81677a4232f..23ef2e56808b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2696,10 +2696,9 @@ static int attach_recursive_mnt(struct mount *source_mnt, child->mnt_mountpoint); commit_tree(child); if (q) { + struct mount *r = topmost_overmount(child); struct mountpoint *mp = root.mp; - struct mount *r = child; - while (unlikely(r->overmount)) - r = r->overmount; + if (unlikely(shorter) && child != source_mnt) mp = shorter; mnt_change_mountpoint(r, mp, q); @@ -6173,9 +6172,7 @@ bool current_chrooted(void) guard(mount_locked_reader)(); - root = current->nsproxy->mnt_ns->root; - while (unlikely(root->overmount)) - root = root->overmount; + root = topmost_overmount(current->nsproxy->mnt_ns->root); return fs_root.mnt != &root->mnt || !path_mounted(&fs_root); } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 35/65] do_lock_mount(): don't modify path. 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (33 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 34/65] new helper: topmost_overmount() Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 36/65] constify check_mnt() Al Viro ` (39 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Currently do_lock_mount() has the target path switched to whatever might be overmounting it. We _do_ want to have the parent mount/mountpoint chosen on top of the overmounting pile; however, the way it's done has unpleasant races - if umount propagation removes the overmount while we'd been trying to set the environment up, we might end up failing if our target path strays into that overmount just before the overmount gets kicked out. Users of do_lock_mount() do not need the target path changed - they have all information in res->{parent,mp}; only one place (in do_move_mount()) currently uses the resulting path->mnt, and that value is trivial to reconstruct by the original value of path->mnt + chosen parent mount. Let's keep the target path unchanged; it avoids a bunch of subtle races and it's not hard to do: do as mount_locked_reader find the prospective parent mount/mountpoint dentry grab references if it's not the original target lock the prospective mountpoint dentry take namespace_sem exclusive if prospective parent/mountpoint would be different now err = -EAGAIN else if location has been unmounted err = -ENOENT else if mountpoint dentry is not allowed to be mounted on err = -ENOENT else if beneath and the top of the pile was the absolute root err = -EINVAL else try to get struct mountpoint (by dentry), set err to 0 on success and -ENO{MEM,ENT} on failure if err != 0 res->parent = ERR_PTR(err) drop locks else res->parent = prospective parent drop temporary references while err == -EAGAIN A somewhat subtle part is that dropping temporary references is allowed. Neither mounts nor dentries should be evicted by a thread that holds namespace_sem. On success we are dropping those references under namespace_sem, so we need to be sure that these are not the last references remaining. However, on success we'd already verified (under namespace_sem) that original target is still mounted and that mount and dentry we are about to drop are still reachable from it via the mount tree. That guarantees that we are not about to drop the last remaining references. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 122 ++++++++++++++++++++++++++----------------------- 1 file changed, 65 insertions(+), 57 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 23ef2e56808b..c2e074f66bd1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2727,6 +2727,27 @@ static int attach_recursive_mnt(struct mount *source_mnt, return err; } +static inline struct mount *where_to_mount(const struct path *path, + struct dentry **dentry, + bool beneath) +{ + struct mount *m; + + if (unlikely(beneath)) { + m = topmost_overmount(real_mount(path->mnt)); + *dentry = m->mnt_mountpoint; + return m->mnt_parent; + } + m = __lookup_mnt(path->mnt, path->dentry); + if (unlikely(m)) { + m = topmost_overmount(m); + *dentry = m->mnt.mnt_root; + return m; + } + *dentry = path->dentry; + return real_mount(path->mnt); +} + /** * do_lock_mount - acquire environment for mounting * @path: target path @@ -2758,84 +2779,69 @@ static int attach_recursive_mnt(struct mount *source_mnt, * case we also require the location to be at the root of a mount * that has a parent (i.e. is not a root of some namespace). */ -static void do_lock_mount(struct path *path, struct pinned_mountpoint *res, bool beneath) +static void do_lock_mount(const struct path *path, + struct pinned_mountpoint *res, + bool beneath) { - struct vfsmount *mnt = path->mnt; - struct dentry *dentry; - struct path under = {}; - int err = -ENOENT; + int err; if (unlikely(beneath) && !path_mounted(path)) { res->parent = ERR_PTR(-EINVAL); return; } - for (;;) { - struct mount *m = real_mount(mnt); - - if (beneath) { - path_put(&under); - read_seqlock_excl(&mount_lock); - if (unlikely(!mnt_has_parent(m))) { - read_sequnlock_excl(&mount_lock); - res->parent = ERR_PTR(-EINVAL); - return; + do { + struct dentry *dentry, *d; + struct mount *m, *n; + + scoped_guard(mount_locked_reader) { + m = where_to_mount(path, &dentry, beneath); + if (&m->mnt != path->mnt) { + mntget(&m->mnt); + dget(dentry); } - under.mnt = mntget(&m->mnt_parent->mnt); - under.dentry = dget(m->mnt_mountpoint); - read_sequnlock_excl(&mount_lock); - dentry = under.dentry; - } else { - dentry = path->dentry; } inode_lock(dentry->d_inode); namespace_lock(); - if (unlikely(cant_mount(dentry) || !is_mounted(mnt))) - break; // not to be mounted on + // check if the chain of mounts (if any) has changed. + scoped_guard(mount_locked_reader) + n = where_to_mount(path, &d, beneath); - if (beneath && unlikely(m->mnt_mountpoint != dentry || - &m->mnt_parent->mnt != under.mnt)) { - namespace_unlock(); - inode_unlock(dentry->d_inode); - continue; // got moved - } + if (unlikely(n != m || dentry != d)) + err = -EAGAIN; // something moved, retry + else if (unlikely(cant_mount(dentry) || !is_mounted(path->mnt))) + err = -ENOENT; // not to be mounted on + else if (beneath && &m->mnt == path->mnt && !m->overmount) + err = -EINVAL; + else + err = get_mountpoint(dentry, res); - mnt = lookup_mnt(path); - if (unlikely(mnt)) { + if (unlikely(err)) { + res->parent = ERR_PTR(err); namespace_unlock(); inode_unlock(dentry->d_inode); - path_put(path); - path->mnt = mnt; - path->dentry = dget(mnt->mnt_root); - continue; // got overmounted + } else { + res->parent = m; } - err = get_mountpoint(dentry, res); - if (err) - break; - if (beneath) { - /* - * @under duplicates the references that will stay - * at least until namespace_unlock(), so the path_put() - * below is safe (and OK to do under namespace_lock - - * we are not dropping the final references here). - */ - path_put(&under); - res->parent = real_mount(path->mnt)->mnt_parent; - return; + /* + * Drop the temporary references. This is subtle - on success + * we are doing that under namespace_sem, which would normally + * be forbidden. However, in that case we are guaranteed that + * refcounts won't reach zero, since we know that path->mnt + * is mounted and thus all mounts reachable from it are pinned + * and stable, along with their mountpoints and roots. + */ + if (&m->mnt != path->mnt) { + dput(dentry); + mntput(&m->mnt); } - res->parent = real_mount(path->mnt); - return; - } - namespace_unlock(); - inode_unlock(dentry->d_inode); - if (beneath) - path_put(&under); - res->parent = ERR_PTR(err); + } while (err == -EAGAIN); } -static inline void lock_mount(struct path *path, struct pinned_mountpoint *m) +static inline void lock_mount(const struct path *path, + struct pinned_mountpoint *m) { do_lock_mount(path, m, false); } @@ -3618,6 +3624,8 @@ static int do_move_mount(struct path *old_path, if (beneath) { struct mount *over = real_mount(new_path->mnt); + if (mp.parent != over->mnt_parent) + over = mp.parent->overmount; err = can_move_mount_beneath(old, over, mp.mp); if (err) return err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 36/65] constify check_mnt() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (34 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 35/65] do_lock_mount(): don't modify path Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:54 ` [PATCH v3 37/65] do_mount_setattr(): constify path argument Al Viro ` (38 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index c2e074f66bd1..511e49fd7c27 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1010,7 +1010,7 @@ static void unpin_mountpoint(struct pinned_mountpoint *m) } } -static inline int check_mnt(struct mount *mnt) +static inline int check_mnt(const struct mount *mnt) { return mnt->mnt_ns == current->nsproxy->mnt_ns; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 37/65] do_mount_setattr(): constify path argument 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (35 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 36/65] constify check_mnt() Al Viro @ 2025-09-03 4:54 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 38/65] do_set_group(): constify path arguments Al Viro ` (37 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:54 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 511e49fd7c27..f74a0523194a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4865,7 +4865,7 @@ static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt) touch_mnt_namespace(mnt->mnt_ns); } -static int do_mount_setattr(struct path *path, struct mount_kattr *kattr) +static int do_mount_setattr(const struct path *path, struct mount_kattr *kattr) { struct mount *mnt = real_mount(path->mnt); int err = 0; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 38/65] do_set_group(): constify path arguments 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (36 preceding siblings ...) 2025-09-03 4:54 ` [PATCH v3 37/65] do_mount_setattr(): constify path argument Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 39/65] drop_collected_paths(): constify arguments Al Viro ` (36 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index f74a0523194a..7da3a589c775 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3359,7 +3359,7 @@ static inline int tree_contains_unbindable(struct mount *mnt) return 0; } -static int do_set_group(struct path *from_path, struct path *to_path) +static int do_set_group(const struct path *from_path, const struct path *to_path) { struct mount *from = real_mount(from_path->mnt); struct mount *to = real_mount(to_path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 39/65] drop_collected_paths(): constify arguments 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (37 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 38/65] do_set_group(): constify path arguments Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 40/65] collect_paths(): constify the return value Al Viro ` (35 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and use that to constify the pointers in callers Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- include/linux/mount.h | 2 +- kernel/audit_tree.c | 12 ++++++------ 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 7da3a589c775..704eff14735d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2334,9 +2334,9 @@ struct path *collect_paths(const struct path *path, return res; } -void drop_collected_paths(struct path *paths, struct path *prealloc) +void drop_collected_paths(const struct path *paths, struct path *prealloc) { - for (struct path *p = paths; p->mnt; p++) + for (const struct path *p = paths; p->mnt; p++) path_put(p); if (paths != prealloc) kfree(paths); diff --git a/include/linux/mount.h b/include/linux/mount.h index 5f9c053b0897..c09032463b36 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -105,7 +105,7 @@ extern int may_umount(struct vfsmount *); int do_mount(const char *, const char __user *, const char *, unsigned long, void *); extern struct path *collect_paths(const struct path *, struct path *, unsigned); -extern void drop_collected_paths(struct path *, struct path *); +extern void drop_collected_paths(const struct path *, struct path *); extern void kern_unmount_array(struct vfsmount *mnt[], unsigned int num); extern int cifs_root_data(char **dev, char **opts); diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index b0eae2a3c895..32007edf0e55 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -678,7 +678,7 @@ void audit_trim_trees(void) struct audit_tree *tree; struct path path; struct audit_node *node; - struct path *paths; + const struct path *paths; struct path array[16]; int err; @@ -701,7 +701,7 @@ void audit_trim_trees(void) struct audit_chunk *chunk = find_chunk(node); /* this could be NULL if the watch is dying else where... */ node->index |= 1U<<31; - for (struct path *p = paths; p->dentry; p++) { + for (const struct path *p = paths; p->dentry; p++) { struct inode *inode = p->dentry->d_inode; if (inode_to_key(inode) == chunk->key) { node->index &= ~(1U<<31); @@ -740,9 +740,9 @@ void audit_put_tree(struct audit_tree *tree) put_tree(tree); } -static int tag_mounts(struct path *paths, struct audit_tree *tree) +static int tag_mounts(const struct path *paths, struct audit_tree *tree) { - for (struct path *p = paths; p->dentry; p++) { + for (const struct path *p = paths; p->dentry; p++) { int err = tag_chunk(p->dentry->d_inode, tree); if (err) return err; @@ -805,7 +805,7 @@ int audit_add_tree_rule(struct audit_krule *rule) struct audit_tree *seed = rule->tree, *tree; struct path path; struct path array[16]; - struct path *paths; + const struct path *paths; int err; rule->tree = NULL; @@ -877,7 +877,7 @@ int audit_tag_tree(char *old, char *new) int failed = 0; struct path path1, path2; struct path array[16]; - struct path *paths; + const struct path *paths; int err; err = kern_path(new, 0, &path2); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 40/65] collect_paths(): constify the return value 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (38 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 39/65] drop_collected_paths(): constify arguments Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 41/65] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro ` (34 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds callers have no business modifying the paths they get Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- include/linux/mount.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 704eff14735d..759bfd24d1a0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2300,7 +2300,7 @@ static inline bool extend_array(struct path **res, struct path **to_free, return p; } -struct path *collect_paths(const struct path *path, +const struct path *collect_paths(const struct path *path, struct path *prealloc, unsigned count) { struct mount *root = real_mount(path->mnt); @@ -2334,7 +2334,7 @@ struct path *collect_paths(const struct path *path, return res; } -void drop_collected_paths(const struct path *paths, struct path *prealloc) +void drop_collected_paths(const struct path *paths, const struct path *prealloc) { for (const struct path *p = paths; p->mnt; p++) path_put(p); diff --git a/include/linux/mount.h b/include/linux/mount.h index c09032463b36..18e4b97f8a98 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -104,8 +104,8 @@ extern int may_umount_tree(struct vfsmount *); extern int may_umount(struct vfsmount *); int do_mount(const char *, const char __user *, const char *, unsigned long, void *); -extern struct path *collect_paths(const struct path *, struct path *, unsigned); -extern void drop_collected_paths(const struct path *, struct path *); +extern const struct path *collect_paths(const struct path *, struct path *, unsigned); +extern void drop_collected_paths(const struct path *, const struct path *); extern void kern_unmount_array(struct vfsmount *mnt[], unsigned int num); extern int cifs_root_data(char **dev, char **opts); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 41/65] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (39 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 40/65] collect_paths(): constify the return value Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 42/65] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro ` (33 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 759bfd24d1a0..dcaf50e920af 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3572,8 +3572,9 @@ static inline bool may_use_mount(struct mount *mnt) return check_anonymous_mnt(mnt); } -static int do_move_mount(struct path *old_path, - struct path *new_path, enum mnt_tree_flags_t flags) +static int do_move_mount(const struct path *old_path, + const struct path *new_path, + enum mnt_tree_flags_t flags) { struct mount *old = real_mount(old_path->mnt); int err; @@ -3645,7 +3646,7 @@ static int do_move_mount(struct path *old_path, return attach_recursive_mnt(old, &mp); } -static int do_move_mount_old(struct path *path, const char *old_name) +static int do_move_mount_old(const struct path *path, const char *old_name) { struct path old_path; int err; @@ -4475,7 +4476,8 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, return ret; } -static inline int vfs_move_mount(struct path *from_path, struct path *to_path, +static inline int vfs_move_mount(const struct path *from_path, + const struct path *to_path, enum mnt_tree_flags_t mflags) { int ret; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 42/65] mnt_warn_timestamp_expiry(): constify struct path argument 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (40 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 41/65] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 43/65] do_new_mount{,_fc}(): " Al Viro ` (32 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index dcaf50e920af..be3aecc5a9c0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3230,7 +3230,8 @@ static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags) touch_mnt_namespace(mnt->mnt_ns); } -static void mnt_warn_timestamp_expiry(struct path *mountpoint, struct vfsmount *mnt) +static void mnt_warn_timestamp_expiry(const struct path *mountpoint, + struct vfsmount *mnt) { struct super_block *sb = mnt->mnt_sb; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 43/65] do_new_mount{,_fc}(): constify struct path argument 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (41 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 42/65] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 44/65] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro ` (31 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index be3aecc5a9c0..f3f26125444d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3704,7 +3704,7 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags * Create a new mount using a superblock configuration and request it * be added to the namespace tree. */ -static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, +static int do_new_mount_fc(struct fs_context *fc, const struct path *mountpoint, unsigned int mnt_flags) { struct super_block *sb; @@ -3735,8 +3735,9 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint, * create a new mount for userspace and request it to be added into the * namespace's tree */ -static int do_new_mount(struct path *path, const char *fstype, int sb_flags, - int mnt_flags, const char *name, void *data) +static int do_new_mount(const struct path *path, const char *fstype, + int sb_flags, int mnt_flags, + const char *name, void *data) { struct file_system_type *type; struct fs_context *fc; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 44/65] do_{loopback,change_type,remount,reconfigure_mnt}(): constify struct path argument 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (42 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 43/65] do_new_mount{,_fc}(): " Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 45/65] path_mount(): " Al Viro ` (30 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index f3f26125444d..894631bcbdbd 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2914,7 +2914,7 @@ static int flags_to_propagation_type(int ms_flags) /* * recursively change the type of the mountpoint. */ -static int do_change_type(struct path *path, int ms_flags) +static int do_change_type(const struct path *path, int ms_flags) { struct mount *m; struct mount *mnt = real_mount(path->mnt); @@ -3034,8 +3034,8 @@ static struct mount *__do_loopback(struct path *old_path, int recurse) /* * do loopback mount. */ -static int do_loopback(struct path *path, const char *old_name, - int recurse) +static int do_loopback(const struct path *path, const char *old_name, + int recurse) { struct path old_path __free(path_put) = {}; struct mount *mnt = NULL; @@ -3265,7 +3265,7 @@ static void mnt_warn_timestamp_expiry(const struct path *mountpoint, * superblock it refers to. This is triggered by specifying MS_REMOUNT|MS_BIND * to mount(2). */ -static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags) +static int do_reconfigure_mnt(const struct path *path, unsigned int mnt_flags) { struct super_block *sb = path->mnt->mnt_sb; struct mount *mnt = real_mount(path->mnt); @@ -3302,7 +3302,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags) * If you've mounted a non-root directory somewhere and want to do remount * on it - tough luck. */ -static int do_remount(struct path *path, int ms_flags, int sb_flags, +static int do_remount(const struct path *path, int ms_flags, int sb_flags, int mnt_flags, void *data) { int err; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 45/65] path_mount(): constify struct path argument 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (43 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 44/65] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 46/65] may_copy_tree(), __do_loopback(): " Al Viro ` (29 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds now it finally can be done. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/internal.h | 2 +- fs/namespace.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 38e8aab27bbd..fe88563b4822 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -84,7 +84,7 @@ void mnt_put_write_access_file(struct file *file); extern void dissolve_on_fput(struct vfsmount *); extern bool may_mount(void); -int path_mount(const char *dev_name, struct path *path, +int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page); int path_umount(struct path *path, int flags); diff --git a/fs/namespace.c b/fs/namespace.c index 894631bcbdbd..3a9db3e84a92 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4018,7 +4018,7 @@ static char *copy_mount_string(const void __user *data) * Therefore, if this magic number is present, it carries no information * and must be discarded. */ -int path_mount(const char *dev_name, struct path *path, +int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page) { unsigned int mnt_flags = 0, sb_flags; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 46/65] may_copy_tree(), __do_loopback(): constify struct path argument 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (44 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 45/65] path_mount(): " Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 47/65] path_umount(): " Al Viro ` (28 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 3a9db3e84a92..4ed3d16534bb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2990,7 +2990,7 @@ static int do_change_type(const struct path *path, int ms_flags) * * Returns true if the mount tree can be copied, false otherwise. */ -static inline bool may_copy_tree(struct path *path) +static inline bool may_copy_tree(const struct path *path) { struct mount *mnt = real_mount(path->mnt); const struct dentry_operations *d_op; @@ -3012,7 +3012,7 @@ static inline bool may_copy_tree(struct path *path) } -static struct mount *__do_loopback(struct path *old_path, int recurse) +static struct mount *__do_loopback(const struct path *old_path, int recurse) { struct mount *old = real_mount(old_path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 47/65] path_umount(): constify struct path argument 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (45 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 46/65] may_copy_tree(), __do_loopback(): " Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 48/65] constify can_move_mount_beneath() arguments Al Viro ` (27 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/internal.h | 2 +- fs/namespace.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index fe88563b4822..549e6bd453b0 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -86,7 +86,7 @@ extern bool may_mount(void); int path_mount(const char *dev_name, const struct path *path, const char *type_page, unsigned long flags, void *data_page); -int path_umount(struct path *path, int flags); +int path_umount(const struct path *path, int flags); int show_path(struct seq_file *m, struct dentry *root); diff --git a/fs/namespace.c b/fs/namespace.c index 4ed3d16534bb..20c409852f6d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2084,7 +2084,7 @@ static int can_umount(const struct path *path, int flags) } // caller is responsible for flags being sane -int path_umount(struct path *path, int flags) +int path_umount(const struct path *path, int flags) { struct mount *mnt = real_mount(path->mnt); int ret; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 48/65] constify can_move_mount_beneath() arguments 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (46 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 47/65] path_umount(): " Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 49/65] do_move_mount_old(): use __free(path_put) Al Viro ` (26 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 20c409852f6d..18229a6e045d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3472,8 +3472,8 @@ static bool mount_is_ancestor(const struct mount *p1, const struct mount *p2) * Context: This function expects namespace_lock() to be held. * Return: On success 0, and on error a negative error code is returned. */ -static int can_move_mount_beneath(struct mount *mnt_from, - struct mount *mnt_to, +static int can_move_mount_beneath(const struct mount *mnt_from, + const struct mount *mnt_to, const struct mountpoint *mp) { struct mount *parent_mnt_to = mnt_to->mnt_parent; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 49/65] do_move_mount_old(): use __free(path_put) 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (47 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 48/65] constify can_move_mount_beneath() arguments Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 50/65] do_mount(): " Al Viro ` (25 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 18229a6e045d..5372b71a8d7a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3649,7 +3649,7 @@ static int do_move_mount(const struct path *old_path, static int do_move_mount_old(const struct path *path, const char *old_name) { - struct path old_path; + struct path old_path __free(path_put) = {}; int err; if (!old_name || !*old_name) @@ -3659,9 +3659,7 @@ static int do_move_mount_old(const struct path *path, const char *old_name) if (err) return err; - err = do_move_mount(&old_path, path, 0); - path_put(&old_path); - return err; + return do_move_mount(&old_path, path, 0); } /* -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 50/65] do_mount(): use __free(path_put) 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (48 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 49/65] do_move_mount_old(): use __free(path_put) Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 51/65] umount_tree(): take all victims out of propagation graph at once Al Viro ` (24 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 5372b71a8d7a..f977438b4d6e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4098,15 +4098,13 @@ int path_mount(const char *dev_name, const struct path *path, int do_mount(const char *dev_name, const char __user *dir_name, const char *type_page, unsigned long flags, void *data_page) { - struct path path; + struct path path __free(path_put) = {}; int ret; ret = user_path_at(AT_FDCWD, dir_name, LOOKUP_FOLLOW, &path); if (ret) return ret; - ret = path_mount(dev_name, &path, type_page, flags, data_page); - path_put(&path); - return ret; + return path_mount(dev_name, &path, type_page, flags, data_page); } static struct ucounts *inc_mnt_namespaces(struct user_namespace *ns) -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 51/65] umount_tree(): take all victims out of propagation graph at once 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (49 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 50/65] do_mount(): " Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 52/65] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro ` (23 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds For each removed mount we need to calculate where the slaves will end up. To avoid duplicating that work, do it for all mounts to be removed at once, taking the mounts themselves out of propagation graph as we go, then do all transfers; the duplicate work on finding destinations is avoided since if we run into a mount that already had destination found, we don't need to trace the rest of the way. That's guaranteed O(removed mounts) for finding destinations and removing from propagation graph and O(surviving mounts that have master removed) for transfers. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 3 ++- fs/pnode.c | 67 +++++++++++++++++++++++++++++++++++++++----------- fs/pnode.h | 1 + 3 files changed, 55 insertions(+), 16 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index f977438b4d6e..0900fd7456a9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1846,6 +1846,8 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how) if (how & UMOUNT_PROPAGATE) propagate_umount(&tmp_list); + bulk_make_private(&tmp_list); + while (!list_empty(&tmp_list)) { struct mnt_namespace *ns; bool disconnect; @@ -1870,7 +1872,6 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how) umount_mnt(p); } } - change_mnt_propagation(p, MS_PRIVATE); if (disconnect) hlist_add_head(&p->mnt_umount, &unmounted); diff --git a/fs/pnode.c b/fs/pnode.c index edaf9d9d0eaf..5d91c3e58d2a 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -71,19 +71,6 @@ static inline bool will_be_unmounted(struct mount *m) return m->mnt.mnt_flags & MNT_UMOUNT; } -static struct mount *propagation_source(struct mount *mnt) -{ - do { - struct mount *m; - for (m = next_peer(mnt); m != mnt; m = next_peer(m)) { - if (!will_be_unmounted(m)) - return m; - } - mnt = mnt->mnt_master; - } while (mnt && will_be_unmounted(mnt)); - return mnt; -} - static void transfer_propagation(struct mount *mnt, struct mount *to) { struct hlist_node *p = NULL, *n; @@ -112,11 +99,10 @@ void change_mnt_propagation(struct mount *mnt, int type) return; } if (IS_MNT_SHARED(mnt)) { - if (type == MS_SLAVE || !hlist_empty(&mnt->mnt_slave_list)) - m = propagation_source(mnt); if (list_empty(&mnt->mnt_share)) { mnt_release_group_id(mnt); } else { + m = next_peer(mnt); list_del_init(&mnt->mnt_share); mnt->mnt_group_id = 0; } @@ -137,6 +123,57 @@ void change_mnt_propagation(struct mount *mnt, int type) } } +static struct mount *trace_transfers(struct mount *m) +{ + while (1) { + struct mount *next = next_peer(m); + + if (next != m) { + list_del_init(&m->mnt_share); + m->mnt_group_id = 0; + m->mnt_master = next; + } else { + if (IS_MNT_SHARED(m)) + mnt_release_group_id(m); + next = m->mnt_master; + } + hlist_del_init(&m->mnt_slave); + CLEAR_MNT_SHARED(m); + SET_MNT_MARK(m); + + if (!next || !will_be_unmounted(next)) + return next; + if (IS_MNT_MARKED(next)) + return next->mnt_master; + m = next; + } +} + +static void set_destinations(struct mount *m, struct mount *master) +{ + struct mount *next; + + while ((next = m->mnt_master) != master) { + m->mnt_master = master; + m = next; + } +} + +void bulk_make_private(struct list_head *set) +{ + struct mount *m; + + list_for_each_entry(m, set, mnt_list) + if (!IS_MNT_MARKED(m)) + set_destinations(m, trace_transfers(m)); + + list_for_each_entry(m, set, mnt_list) { + transfer_propagation(m, m->mnt_master); + m->mnt_master = NULL; + CLEAR_MNT_MARK(m); + } +} + static struct mount *__propagation_next(struct mount *m, struct mount *origin) { diff --git a/fs/pnode.h b/fs/pnode.h index 00ab153e3e9d..b029db225f33 100644 --- a/fs/pnode.h +++ b/fs/pnode.h @@ -42,6 +42,7 @@ static inline bool peers(const struct mount *m1, const struct mount *m2) } void change_mnt_propagation(struct mount *, int); +void bulk_make_private(struct list_head *); int propagate_mnt(struct mount *, struct mountpoint *, struct mount *, struct hlist_head *); void propagate_umount(struct list_head *); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 52/65] ecryptfs: get rid of pointless mount references in ecryptfs dentries 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (50 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 51/65] umount_tree(): take all victims out of propagation graph at once Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 53/65] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro ` (22 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ->lower_path.mnt has the same value for all dentries on given ecryptfs instance and if somebody goes for mountpoint-crossing variant where that would not be true, we can deal with that when it happens (and _not_ with duplicating these reference into each dentry). As it is, we are better off just sticking a reference into ecryptfs-private part of superblock and keeping it pinned until ->kill_sb(). That way we can stick a reference to underlying dentry right into ->d_fsdata of ecryptfs one, getting rid of indirection through struct ecryptfs_dentry_info, along with the entire struct ecryptfs_dentry_info machinery. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/ecryptfs/dentry.c | 14 +------------- fs/ecryptfs/ecryptfs_kernel.h | 27 +++++++++++---------------- fs/ecryptfs/file.c | 15 +++++++-------- fs/ecryptfs/inode.c | 19 +++++-------------- fs/ecryptfs/main.c | 24 ++++++------------------ 5 files changed, 30 insertions(+), 69 deletions(-) diff --git a/fs/ecryptfs/dentry.c b/fs/ecryptfs/dentry.c index 1dfd5b81d831..6648a924e31a 100644 --- a/fs/ecryptfs/dentry.c +++ b/fs/ecryptfs/dentry.c @@ -59,14 +59,6 @@ static int ecryptfs_d_revalidate(struct inode *dir, const struct qstr *name, return rc; } -struct kmem_cache *ecryptfs_dentry_info_cache; - -static void ecryptfs_dentry_free_rcu(struct rcu_head *head) -{ - kmem_cache_free(ecryptfs_dentry_info_cache, - container_of(head, struct ecryptfs_dentry_info, rcu)); -} - /** * ecryptfs_d_release * @dentry: The ecryptfs dentry @@ -75,11 +67,7 @@ static void ecryptfs_dentry_free_rcu(struct rcu_head *head) */ static void ecryptfs_d_release(struct dentry *dentry) { - struct ecryptfs_dentry_info *p = dentry->d_fsdata; - if (p) { - path_put(&p->lower_path); - call_rcu(&p->rcu, ecryptfs_dentry_free_rcu); - } + dput(dentry->d_fsdata); } const struct dentry_operations ecryptfs_dops = { diff --git a/fs/ecryptfs/ecryptfs_kernel.h b/fs/ecryptfs/ecryptfs_kernel.h index 1f562e75d0e4..9e6ab0b41337 100644 --- a/fs/ecryptfs/ecryptfs_kernel.h +++ b/fs/ecryptfs/ecryptfs_kernel.h @@ -258,13 +258,6 @@ struct ecryptfs_inode_info { struct ecryptfs_crypt_stat crypt_stat; }; -/* dentry private data. Each dentry must keep track of a lower - * vfsmount too. */ -struct ecryptfs_dentry_info { - struct path lower_path; - struct rcu_head rcu; -}; - /** * ecryptfs_global_auth_tok - A key used to encrypt all new files under the mountpoint * @flags: Status flags @@ -348,6 +341,7 @@ struct ecryptfs_mount_crypt_stat { /* superblock private data. */ struct ecryptfs_sb_info { struct super_block *wsi_sb; + struct vfsmount *lower_mnt; struct ecryptfs_mount_crypt_stat mount_crypt_stat; }; @@ -494,22 +488,25 @@ ecryptfs_set_superblock_lower(struct super_block *sb, } static inline void -ecryptfs_set_dentry_private(struct dentry *dentry, - struct ecryptfs_dentry_info *dentry_info) +ecryptfs_set_dentry_lower(struct dentry *dentry, + struct dentry *lower_dentry) { - dentry->d_fsdata = dentry_info; + dentry->d_fsdata = lower_dentry; } static inline struct dentry * ecryptfs_dentry_to_lower(struct dentry *dentry) { - return ((struct ecryptfs_dentry_info *)dentry->d_fsdata)->lower_path.dentry; + return dentry->d_fsdata; } -static inline const struct path * -ecryptfs_dentry_to_lower_path(struct dentry *dentry) +static inline struct path +ecryptfs_lower_path(struct dentry *dentry) { - return &((struct ecryptfs_dentry_info *)dentry->d_fsdata)->lower_path; + return (struct path){ + .mnt = ecryptfs_superblock_to_private(dentry->d_sb)->lower_mnt, + .dentry = ecryptfs_dentry_to_lower(dentry) + }; } #define ecryptfs_printk(type, fmt, arg...) \ @@ -532,7 +529,6 @@ extern unsigned int ecryptfs_number_of_users; extern struct kmem_cache *ecryptfs_auth_tok_list_item_cache; extern struct kmem_cache *ecryptfs_file_info_cache; -extern struct kmem_cache *ecryptfs_dentry_info_cache; extern struct kmem_cache *ecryptfs_inode_info_cache; extern struct kmem_cache *ecryptfs_sb_info_cache; extern struct kmem_cache *ecryptfs_header_cache; @@ -557,7 +553,6 @@ int ecryptfs_encrypt_and_encode_filename( size_t *encoded_name_size, struct ecryptfs_mount_crypt_stat *mount_crypt_stat, const char *name, size_t name_size); -struct dentry *ecryptfs_lower_dentry(struct dentry *this_dentry); void ecryptfs_dump_hex(char *data, int bytes); int virt_to_scatterlist(const void *addr, int size, struct scatterlist *sg, int sg_size); diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c index 5f8f96da09fe..7929411837cf 100644 --- a/fs/ecryptfs/file.c +++ b/fs/ecryptfs/file.c @@ -33,13 +33,12 @@ static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb, struct iov_iter *to) { ssize_t rc; - const struct path *path; struct file *file = iocb->ki_filp; rc = generic_file_read_iter(iocb, to); if (rc >= 0) { - path = ecryptfs_dentry_to_lower_path(file->f_path.dentry); - touch_atime(path); + struct path path = ecryptfs_lower_path(file->f_path.dentry); + touch_atime(&path); } return rc; } @@ -59,12 +58,11 @@ static ssize_t ecryptfs_splice_read_update_atime(struct file *in, loff_t *ppos, size_t len, unsigned int flags) { ssize_t rc; - const struct path *path; rc = filemap_splice_read(in, ppos, pipe, len, flags); if (rc >= 0) { - path = ecryptfs_dentry_to_lower_path(in->f_path.dentry); - touch_atime(path); + struct path path = ecryptfs_lower_path(in->f_path.dentry); + touch_atime(&path); } return rc; } @@ -283,6 +281,7 @@ static int ecryptfs_dir_open(struct inode *inode, struct file *file) * ecryptfs_lookup() */ struct ecryptfs_file_info *file_info; struct file *lower_file; + struct path path; /* Released in ecryptfs_release or end of function if failure */ file_info = kmem_cache_zalloc(ecryptfs_file_info_cache, GFP_KERNEL); @@ -292,8 +291,8 @@ static int ecryptfs_dir_open(struct inode *inode, struct file *file) "Error attempting to allocate memory\n"); return -ENOMEM; } - lower_file = dentry_open(ecryptfs_dentry_to_lower_path(ecryptfs_dentry), - file->f_flags, current_cred()); + path = ecryptfs_lower_path(ecryptfs_dentry); + lower_file = dentry_open(&path, file->f_flags, current_cred()); if (IS_ERR(lower_file)) { printk(KERN_ERR "%s: Error attempting to initialize " "the lower file for the dentry with name " diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index 72fbe1316ab8..d2b262dc485d 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -327,24 +327,15 @@ static int ecryptfs_i_size_read(struct dentry *dentry, struct inode *inode) static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry, struct dentry *lower_dentry) { - const struct path *path = ecryptfs_dentry_to_lower_path(dentry->d_parent); + struct dentry *lower_parent = ecryptfs_dentry_to_lower(dentry->d_parent); struct inode *inode, *lower_inode; - struct ecryptfs_dentry_info *dentry_info; int rc = 0; - dentry_info = kmem_cache_alloc(ecryptfs_dentry_info_cache, GFP_KERNEL); - if (!dentry_info) { - dput(lower_dentry); - return ERR_PTR(-ENOMEM); - } - fsstack_copy_attr_atime(d_inode(dentry->d_parent), - d_inode(path->dentry)); + d_inode(lower_parent)); BUG_ON(!d_count(lower_dentry)); - ecryptfs_set_dentry_private(dentry, dentry_info); - dentry_info->lower_path.mnt = mntget(path->mnt); - dentry_info->lower_path.dentry = lower_dentry; + ecryptfs_set_dentry_lower(dentry, lower_dentry); /* * negative dentry can go positive under us here - its parent is not @@ -1022,10 +1013,10 @@ static int ecryptfs_getattr(struct mnt_idmap *idmap, { struct dentry *dentry = path->dentry; struct kstat lower_stat; + struct path lower_path = ecryptfs_lower_path(dentry); int rc; - rc = vfs_getattr_nosec(ecryptfs_dentry_to_lower_path(dentry), - &lower_stat, request_mask, flags); + rc = vfs_getattr_nosec(&lower_path, &lower_stat, request_mask, flags); if (!rc) { fsstack_copy_attr_all(d_inode(dentry), ecryptfs_inode_to_lower(d_inode(dentry))); diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c index eab1beb846d3..2afbcbbd9546 100644 --- a/fs/ecryptfs/main.c +++ b/fs/ecryptfs/main.c @@ -106,15 +106,14 @@ static int ecryptfs_init_lower_file(struct dentry *dentry, struct file **lower_file) { const struct cred *cred = current_cred(); - const struct path *path = ecryptfs_dentry_to_lower_path(dentry); + struct path path = ecryptfs_lower_path(dentry); int rc; - rc = ecryptfs_privileged_open(lower_file, path->dentry, path->mnt, - cred); + rc = ecryptfs_privileged_open(lower_file, path.dentry, path.mnt, cred); if (rc) { printk(KERN_ERR "Error opening lower file " "for lower_dentry [0x%p] and lower_mnt [0x%p]; " - "rc = [%d]\n", path->dentry, path->mnt, rc); + "rc = [%d]\n", path.dentry, path.mnt, rc); (*lower_file) = NULL; } return rc; @@ -437,7 +436,6 @@ static int ecryptfs_get_tree(struct fs_context *fc) struct ecryptfs_fs_context *ctx = fc->fs_private; struct ecryptfs_sb_info *sbi = fc->s_fs_info; struct ecryptfs_mount_crypt_stat *mount_crypt_stat; - struct ecryptfs_dentry_info *root_info; const char *err = "Getting sb failed"; struct inode *inode; struct path path; @@ -543,14 +541,8 @@ static int ecryptfs_get_tree(struct fs_context *fc) goto out_free; } - rc = -ENOMEM; - root_info = kmem_cache_zalloc(ecryptfs_dentry_info_cache, GFP_KERNEL); - if (!root_info) - goto out_free; - - /* ->kill_sb() will take care of root_info */ - ecryptfs_set_dentry_private(s->s_root, root_info); - root_info->lower_path = path; + ecryptfs_set_dentry_lower(s->s_root, path.dentry); + sbi->lower_mnt = path.mnt; s->s_flags |= SB_ACTIVE; fc->root = dget(s->s_root); @@ -580,6 +572,7 @@ static void ecryptfs_kill_block_super(struct super_block *sb) kill_anon_super(sb); if (!sb_info) return; + mntput(sb_info->lower_mnt); ecryptfs_destroy_mount_crypt_stat(&sb_info->mount_crypt_stat); kmem_cache_free(ecryptfs_sb_info_cache, sb_info); } @@ -667,11 +660,6 @@ static struct ecryptfs_cache_info { .name = "ecryptfs_file_cache", .size = sizeof(struct ecryptfs_file_info), }, - { - .cache = &ecryptfs_dentry_info_cache, - .name = "ecryptfs_dentry_info_cache", - .size = sizeof(struct ecryptfs_dentry_info), - }, { .cache = &ecryptfs_inode_info_cache, .name = "ecryptfs_inode_cache", -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 53/65] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (51 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 52/65] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 54/63] open_detached_copy(): don't bother with mount_lock_hash() Al Viro ` (21 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Comments regarding "shadow mounts" were stale - no such thing anymore. Document the locking requirements for __lookup_mnt(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 41 ++++++++++++----------------------------- 1 file changed, 12 insertions(+), 29 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 0900fd7456a9..a195e25a5d61 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -825,24 +825,16 @@ static bool legitimize_mnt(struct vfsmount *bastard, unsigned seq) } /** - * __lookup_mnt - find first child mount + * __lookup_mnt - mount hash lookup * @mnt: parent mount - * @dentry: mountpoint + * @dentry: dentry of mountpoint * - * If @mnt has a child mount @c mounted @dentry find and return it. + * If @mnt has a child mount @c mounted on @dentry find and return it. + * Caller must either hold the spinlock component of @mount_lock or + * hold rcu_read_lock(), sample the seqcount component before the call + * and recheck it afterwards. * - * Note that the child mount @c need not be unique. There are cases - * where shadow mounts are created. For example, during mount - * propagation when a source mount @mnt whose root got overmounted by a - * mount @o after path lookup but before @namespace_sem could be - * acquired gets copied and propagated. So @mnt gets copied including - * @o. When @mnt is propagated to a destination mount @d that already - * has another mount @n mounted at the same mountpoint then the source - * mount @mnt will be tucked beneath @n, i.e., @n will be mounted on - * @mnt and @mnt mounted on @d. Now both @n and @o are mounted at @mnt - * on @dentry. - * - * Return: The first child of @mnt mounted @dentry or NULL. + * Return: The child of @mnt mounted on @dentry or %NULL. */ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) { @@ -855,21 +847,12 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) return NULL; } -/* - * lookup_mnt - Return the first child mount mounted at path - * - * "First" means first mounted chronologically. If you create the - * following mounts: - * - * mount /dev/sda1 /mnt - * mount /dev/sda2 /mnt - * mount /dev/sda3 /mnt - * - * Then lookup_mnt() on the base /mnt dentry in the root mount will - * return successively the root dentry and vfsmount of /dev/sda1, then - * /dev/sda2, then /dev/sda3, then NULL. +/** + * lookup_mnt - Return the child mount mounted at given location + * @path: location in the namespace * - * lookup_mnt takes a reference to the found vfsmount. + * Acquires and returns a new reference to mount at given location + * or %NULL if nothing is mounted there. */ struct vfsmount *lookup_mnt(const struct path *path) { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 54/63] open_detached_copy(): don't bother with mount_lock_hash() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (52 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 53/65] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 54/65] path_has_submounts(): use guard(mount_locked_reader) Al Viro ` (20 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds we are holding namespace_sem and a reference to root of tree; iterating through that tree does not need mount_lock. Neither does the insertion into the rbtree of new namespace or incrementing the mount count of that namespace. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 2e35f5eb4f81..425c33377770 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3086,14 +3086,12 @@ static struct file *open_detached_copy(struct path *path, bool recursive) return ERR_CAST(mnt); } - lock_mount_hash(); for (p = mnt; p; p = next_mnt(p, mnt)) { mnt_add_to_ns(ns, p); ns->nr_mounts++; } ns->root = mnt; mntget(&mnt->mnt); - unlock_mount_hash(); namespace_unlock(); mntput(path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 54/65] path_has_submounts(): use guard(mount_locked_reader) 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (53 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 54/63] open_detached_copy(): don't bother with mount_lock_hash() Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 55/65] open_detached_copy(): don't bother with mount_lock_hash() Al Viro ` (19 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Needed there since the callback passed to d_walk() (path_check_mount()) is using __path_is_mountpoint(), which uses __lookup_mnt(). Has to be taken in the caller - d_walk() might take rename_lock spinlock component and that nests inside mount_lock. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/dcache.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 60046ae23d51..ab21a8402db0 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1390,6 +1390,7 @@ struct check_mount { unsigned int mounted; }; +/* locks: mount_locked_reader && dentry->d_lock */ static enum d_walk_ret path_check_mount(void *data, struct dentry *dentry) { struct check_mount *info = data; @@ -1416,9 +1417,8 @@ int path_has_submounts(const struct path *parent) { struct check_mount data = { .mnt = parent->mnt, .mounted = 0 }; - read_seqlock_excl(&mount_lock); + guard(mount_locked_reader)(); d_walk(parent->dentry, &data, path_check_mount); - read_sequnlock_excl(&mount_lock); return data.mounted; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 55/65] open_detached_copy(): don't bother with mount_lock_hash() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (54 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 54/65] path_has_submounts(): use guard(mount_locked_reader) Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 55/63] open_detached_copy(): separate creation of namespace into helper Al Viro ` (18 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds we are holding namespace_sem and a reference to root of tree; iterating through that tree does not need mount_lock. Neither does the insertion into the rbtree of new namespace or incrementing the mount count of that namespace. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a195e25a5d61..69ef608b8c3a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3086,14 +3086,12 @@ static struct file *open_detached_copy(struct path *path, bool recursive) return ERR_CAST(mnt); } - lock_mount_hash(); for (p = mnt; p; p = next_mnt(p, mnt)) { mnt_add_to_ns(ns, p); ns->nr_mounts++; } ns->root = mnt; mntget(&mnt->mnt); - unlock_mount_hash(); namespace_unlock(); mntput(path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 55/63] open_detached_copy(): separate creation of namespace into helper 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (55 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 55/65] open_detached_copy(): don't bother with mount_lock_hash() Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro ` (17 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and convert the helper to use of a guard(namespace_excl) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 425c33377770..c324800e770c 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3053,18 +3053,17 @@ static int do_loopback(const struct path *path, const char *old_name, return err; } -static struct file *open_detached_copy(struct path *path, bool recursive) +static struct mnt_namespace *get_detached_copy(const struct path *path, bool recursive) { struct mnt_namespace *ns, *mnt_ns = current->nsproxy->mnt_ns, *src_mnt_ns; struct user_namespace *user_ns = mnt_ns->user_ns; struct mount *mnt, *p; - struct file *file; ns = alloc_mnt_ns(user_ns, true); if (IS_ERR(ns)) - return ERR_CAST(ns); + return ns; - namespace_lock(); + guard(namespace_excl)(); /* * Record the sequence number of the source mount namespace. @@ -3081,8 +3080,7 @@ static struct file *open_detached_copy(struct path *path, bool recursive) mnt = __do_loopback(path, recursive); if (IS_ERR(mnt)) { - namespace_unlock(); - free_mnt_ns(ns); + emptied_ns = ns; return ERR_CAST(mnt); } @@ -3091,11 +3089,19 @@ static struct file *open_detached_copy(struct path *path, bool recursive) ns->nr_mounts++; } ns->root = mnt; - mntget(&mnt->mnt); - namespace_unlock(); + return ns; +} + +static struct file *open_detached_copy(struct path *path, bool recursive) +{ + struct mnt_namespace *ns = get_detached_copy(path, recursive); + struct file *file; + + if (IS_ERR(ns)) + return ERR_CAST(ns); mntput(path->mnt); - path->mnt = &mnt->mnt; + path->mnt = mntget(&ns->root->mnt); file = dentry_open(path, O_PATH, current_cred()); if (IS_ERR(file)) dissolve_on_fput(path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (56 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 55/63] open_detached_copy(): separate creation of namespace into helper Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 56/65] open_detached_copy(): separate creation of namespace into helper Al Viro ` (16 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Actual removal is done under the lock, but for checking if need to bother the lockless list_empty() is safe - either that namespace never had never been added to mnt_ns_tree, in which case the list will stay empty, or whoever had allocated it has called mnt_ns_tree_add() and it has already run to completion. After that point list_empty() will become false and will remain false, no matter what we do with the neighbors in mnt_ns_list. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index c324800e770c..daa72292ea58 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -195,7 +195,7 @@ static void mnt_ns_release_rcu(struct rcu_head *rcu) static void mnt_ns_tree_remove(struct mnt_namespace *ns) { /* remove from global mount namespace list */ - if (!is_anon_ns(ns)) { + if (!list_empty(&ns->mnt_ns_list)) { mnt_ns_tree_write_lock(); rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree); list_bidir_del_rcu(&ns->mnt_ns_list); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 56/65] open_detached_copy(): separate creation of namespace into helper 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (57 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro ` (15 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... and convert the helper to use of a guard(namespace_excl) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 69ef608b8c3a..5b802cd33058 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -3053,18 +3053,17 @@ static int do_loopback(const struct path *path, const char *old_name, return err; } -static struct file *open_detached_copy(struct path *path, bool recursive) +static struct mnt_namespace *get_detached_copy(const struct path *path, bool recursive) { struct mnt_namespace *ns, *mnt_ns = current->nsproxy->mnt_ns, *src_mnt_ns; struct user_namespace *user_ns = mnt_ns->user_ns; struct mount *mnt, *p; - struct file *file; ns = alloc_mnt_ns(user_ns, true); if (IS_ERR(ns)) - return ERR_CAST(ns); + return ns; - namespace_lock(); + guard(namespace_excl)(); /* * Record the sequence number of the source mount namespace. @@ -3081,8 +3080,7 @@ static struct file *open_detached_copy(struct path *path, bool recursive) mnt = __do_loopback(path, recursive); if (IS_ERR(mnt)) { - namespace_unlock(); - free_mnt_ns(ns); + emptied_ns = ns; return ERR_CAST(mnt); } @@ -3091,11 +3089,19 @@ static struct file *open_detached_copy(struct path *path, bool recursive) ns->nr_mounts++; } ns->root = mnt; - mntget(&mnt->mnt); - namespace_unlock(); + return ns; +} + +static struct file *open_detached_copy(struct path *path, bool recursive) +{ + struct mnt_namespace *ns = get_detached_copy(path, recursive); + struct file *file; + + if (IS_ERR(ns)) + return ERR_CAST(ns); mntput(path->mnt); - path->mnt = &mnt->mnt; + path->mnt = mntget(&ns->root->mnt); file = dentry_open(path, O_PATH, current_cred()); if (IS_ERR(file)) dissolve_on_fput(path->mnt); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (58 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 56/65] open_detached_copy(): separate creation of namespace into helper Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 57/65] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro ` (14 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Now that free_mnt_ns() works prior to mnt_ns_tree_add(), there's no need for an open-coded analogue free_mnt_ns() there - yes, we do avoid one call_rcu() use per failing call of clone() or unshare(), if they fail due to OOM in that particular spot, but it's not really worth bothering. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index daa72292ea58..a418555586ef 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4190,10 +4190,8 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, copy_flags |= CL_SLAVE; new = copy_tree(old, old->mnt.mnt_root, copy_flags); if (IS_ERR(new)) { + emptied_ns = new_ns; namespace_unlock(); - ns_free_inum(&new_ns->ns); - dec_mnt_namespaces(new_ns->ucounts); - mnt_ns_release(new_ns); return ERR_CAST(new); } if (user_ns != ns->user_ns) { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 57/65] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (59 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 58/63] copy_mnt_ns(): use guards Al Viro ` (13 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Actual removal is done under the lock, but for checking if need to bother the lockless list_empty() is safe - either that namespace had never been added to mnt_ns_tree, in which case the list will stay empty, or whoever had allocated it has called mnt_ns_tree_add() and it has already run to completion. After that point list_empty() will become false and will remain false, no matter what we do with the neighbors in mnt_ns_list. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index 5b802cd33058..c175536cc7b5 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -195,7 +195,7 @@ static void mnt_ns_release_rcu(struct rcu_head *rcu) static void mnt_ns_tree_remove(struct mnt_namespace *ns) { /* remove from global mount namespace list */ - if (!is_anon_ns(ns)) { + if (!list_empty(&ns->mnt_ns_list)) { mnt_ns_tree_write_lock(); rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree); list_bidir_del_rcu(&ns->mnt_ns_list); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 58/63] copy_mnt_ns(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (60 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 57/65] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 58/65] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro ` (12 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds * mntput() of rootmnt and pwdmnt done via __free(mntput) * mnt_ns_tree_add() can be done within namespace_excl scope. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index a418555586ef..9e16231d4561 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4164,7 +4164,8 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, struct user_namespace *user_ns, struct fs_struct *new_fs) { struct mnt_namespace *new_ns; - struct vfsmount *rootmnt = NULL, *pwdmnt = NULL; + struct vfsmount *rootmnt __free(mntput) = NULL; + struct vfsmount *pwdmnt __free(mntput) = NULL; struct mount *p, *q; struct mount *old; struct mount *new; @@ -4183,7 +4184,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, if (IS_ERR(new_ns)) return new_ns; - namespace_lock(); + guard(namespace_excl)(); /* First pass: copy the tree topology */ copy_flags = CL_COPY_UNBINDABLE | CL_EXPIRE; if (user_ns != ns->user_ns) @@ -4191,13 +4192,11 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, new = copy_tree(old, old->mnt.mnt_root, copy_flags); if (IS_ERR(new)) { emptied_ns = new_ns; - namespace_unlock(); return ERR_CAST(new); } if (user_ns != ns->user_ns) { - lock_mount_hash(); + guard(mount_writer)(); lock_mnt_tree(new); - unlock_mount_hash(); } new_ns->root = new; @@ -4229,14 +4228,6 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, while (p->mnt.mnt_root != q->mnt.mnt_root) p = next_mnt(skip_mnt_tree(p), old); } - namespace_unlock(); - - if (rootmnt) - mntput(rootmnt); - if (pwdmnt) - mntput(pwdmnt); - - mnt_ns_tree_add(new_ns); return new_ns; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 58/65] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (61 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 58/63] copy_mnt_ns(): use guards Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 59/65] copy_mnt_ns(): use guards Al Viro ` (11 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Now that free_mnt_ns() works prior to mnt_ns_tree_add(), there's no need for an open-coded analogue free_mnt_ns() there - yes, we do avoid one call_rcu() use per failing call of clone() or unshare(), if they fail due to OOM in that particular spot, but it's not really worth bothering. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index c175536cc7b5..0cd62478ff36 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4190,10 +4190,8 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, copy_flags |= CL_SLAVE; new = copy_tree(old, old->mnt.mnt_root, copy_flags); if (IS_ERR(new)) { + emptied_ns = new_ns; namespace_unlock(); - ns_free_inum(&new_ns->ns); - dec_mnt_namespaces(new_ns->ucounts); - mnt_ns_release(new_ns); return ERR_CAST(new); } if (user_ns != ns->user_ns) { -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 59/65] copy_mnt_ns(): use guards 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (62 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 58/65] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 59/63] simplify the callers of mnt_unhold_writers() Al Viro ` (10 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds * mntput() of rootmnt and pwdmnt done via __free(mntput) * mnt_ns_tree_add() can be done within namespace_excl scope. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 0cd62478ff36..3bb9f7ac4be6 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4164,7 +4164,8 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, struct user_namespace *user_ns, struct fs_struct *new_fs) { struct mnt_namespace *new_ns; - struct vfsmount *rootmnt = NULL, *pwdmnt = NULL; + struct vfsmount *rootmnt __free(mntput) = NULL; + struct vfsmount *pwdmnt __free(mntput) = NULL; struct mount *p, *q; struct mount *old; struct mount *new; @@ -4183,7 +4184,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, if (IS_ERR(new_ns)) return new_ns; - namespace_lock(); + guard(namespace_excl)(); /* First pass: copy the tree topology */ copy_flags = CL_COPY_UNBINDABLE | CL_EXPIRE; if (user_ns != ns->user_ns) @@ -4191,13 +4192,11 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, new = copy_tree(old, old->mnt.mnt_root, copy_flags); if (IS_ERR(new)) { emptied_ns = new_ns; - namespace_unlock(); return ERR_CAST(new); } if (user_ns != ns->user_ns) { - lock_mount_hash(); + guard(mount_writer)(); lock_mnt_tree(new); - unlock_mount_hash(); } new_ns->root = new; @@ -4229,13 +4228,6 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, while (p->mnt.mnt_root != q->mnt.mnt_root) p = next_mnt(skip_mnt_tree(p), old); } - namespace_unlock(); - - if (rootmnt) - mntput(rootmnt); - if (pwdmnt) - mntput(pwdmnt); - mnt_ns_tree_add(new_ns); return new_ns; } -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 59/63] simplify the callers of mnt_unhold_writers() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (63 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 59/65] copy_mnt_ns(): use guards Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 60/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro ` (9 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds The logics in cleanup on failure in mount_setattr_prepare() is simplified by having the mnt_hold_writers() failure followed by advancing m to the next node in the tree before leaving the loop. And since all calls are preceded by the same check that flag has been set and the function is inlined, let's just shift the check into it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 34 ++++++++++------------------------ 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9e16231d4561..d8df1046e2f9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -714,13 +714,14 @@ static inline int mnt_hold_writers(struct mount *mnt) * Stop preventing write access to @mnt allowing callers to gain write access * to @mnt again. * - * This function can only be called after a successful call to - * mnt_hold_writers(). + * This function can only be called after a call to mnt_hold_writers(). * * Context: This function expects lock_mount_hash() to be held. */ static inline void mnt_unhold_writers(struct mount *mnt) { + if (!(mnt->mnt_flags & MNT_WRITE_HOLD)) + return; /* * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers * that become unheld will see MNT_READONLY. @@ -4773,8 +4774,10 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) if (!mnt_allow_writers(kattr, m)) { err = mnt_hold_writers(m); - if (err) + if (err) { + m = next_mnt(m, mnt); break; + } } if (!(kattr->kflags & MOUNT_KATTR_RECURSE)) @@ -4782,25 +4785,9 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) } if (err) { - struct mount *p; - - /* - * If we had to call mnt_hold_writers() MNT_WRITE_HOLD will - * be set in @mnt_flags. The loop unsets MNT_WRITE_HOLD for all - * mounts and needs to take care to include the first mount. - */ - for (p = mnt; p; p = next_mnt(p, mnt)) { - /* If we had to hold writers unblock them. */ - if (p->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt_unhold_writers(p); - - /* - * We're done once the first mount we changed got - * MNT_WRITE_HOLD unset. - */ - if (p == m) - break; - } + /* undo all mnt_hold_writers() we'd done */ + for (struct mount *p = mnt; p != m; p = next_mnt(p, mnt)) + mnt_unhold_writers(p); } return err; } @@ -4831,8 +4818,7 @@ static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt) WRITE_ONCE(m->mnt.mnt_flags, flags); /* If we had to hold writers unblock them. */ - if (m->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt_unhold_writers(m); + mnt_unhold_writers(m); if (kattr->propagation) change_mnt_propagation(m, kattr->propagation); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 60/63] setup_mnt(): primitive for connecting a mount to filesystem 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (64 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 59/63] simplify the callers of mnt_unhold_writers() Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 60/65] simplify the callers of mnt_unhold_writers() Al Viro ` (8 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Take the identical logics in vfs_create_mount() and clone_mnt() into a new helper that takes an empty struct mount and attaches it to given dentry (sub)tree. Should be called once in the lifetime of every mount, prior to making it visible in any data structures. After that point ->mnt_root and ->mnt_sb never change; ->mnt_root is a counting reference to dentry and ->mnt_sb - an active reference to superblock. Mount remains associated with that dentry tree all the way until the call of cleanup_mnt(), when the refcount eventually drops to zero. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index d8df1046e2f9..c769fc4051e0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1196,6 +1196,21 @@ static void commit_tree(struct mount *mnt) touch_mnt_namespace(n); } +static void setup_mnt(struct mount *m, struct dentry *root) +{ + struct super_block *s = root->d_sb; + + atomic_inc(&s->s_active); + m->mnt.mnt_sb = s; + m->mnt.mnt_root = dget(root); + m->mnt_mountpoint = m->mnt.mnt_root; + m->mnt_parent = m; + + lock_mount_hash(); + list_add_tail(&m->mnt_instance, &s->s_mounts); + unlock_mount_hash(); +} + /** * vfs_create_mount - Create a mount for a configured superblock * @fc: The configuration context with the superblock attached @@ -1219,15 +1234,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc) if (fc->sb_flags & SB_KERNMOUNT) mnt->mnt.mnt_flags = MNT_INTERNAL; - atomic_inc(&fc->root->d_sb->s_active); - mnt->mnt.mnt_sb = fc->root->d_sb; - mnt->mnt.mnt_root = dget(fc->root); - mnt->mnt_mountpoint = mnt->mnt.mnt_root; - mnt->mnt_parent = mnt; + setup_mnt(mnt, fc->root); - lock_mount_hash(); - list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts); - unlock_mount_hash(); return &mnt->mnt; } EXPORT_SYMBOL(vfs_create_mount); @@ -1285,7 +1293,6 @@ EXPORT_SYMBOL_GPL(vfs_kern_mount); static struct mount *clone_mnt(struct mount *old, struct dentry *root, int flag) { - struct super_block *sb = old->mnt.mnt_sb; struct mount *mnt; int err; @@ -1310,16 +1317,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, if (mnt->mnt_group_id) set_mnt_shared(mnt); - atomic_inc(&sb->s_active); mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt)); - mnt->mnt.mnt_sb = sb; - mnt->mnt.mnt_root = dget(root); - mnt->mnt_mountpoint = mnt->mnt.mnt_root; - mnt->mnt_parent = mnt; - lock_mount_hash(); - list_add_tail(&mnt->mnt_instance, &sb->s_mounts); - unlock_mount_hash(); + setup_mnt(mnt, root); if (flag & CL_PRIVATE) // we are done with it return mnt; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 60/65] simplify the callers of mnt_unhold_writers() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (65 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 60/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro ` (7 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds The logics in cleanup on failure in mount_setattr_prepare() is simplified by having the mnt_hold_writers() failure followed by advancing m to the next node in the tree before leaving the loop. And since all calls are preceded by the same check that flag has been set and the function is inlined, let's just shift the check into it. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 34 ++++++++++------------------------ 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 3bb9f7ac4be6..b4d287c0af4a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -714,13 +714,14 @@ static inline int mnt_hold_writers(struct mount *mnt) * Stop preventing write access to @mnt allowing callers to gain write access * to @mnt again. * - * This function can only be called after a successful call to - * mnt_hold_writers(). + * This function can only be called after a call to mnt_hold_writers(). * * Context: This function expects lock_mount_hash() to be held. */ static inline void mnt_unhold_writers(struct mount *mnt) { + if (!(mnt->mnt_flags & MNT_WRITE_HOLD)) + return; /* * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers * that become unheld will see MNT_READONLY. @@ -4774,8 +4775,10 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) if (!mnt_allow_writers(kattr, m)) { err = mnt_hold_writers(m); - if (err) + if (err) { + m = next_mnt(m, mnt); break; + } } if (!(kattr->kflags & MOUNT_KATTR_RECURSE)) @@ -4783,25 +4786,9 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) } if (err) { - struct mount *p; - - /* - * If we had to call mnt_hold_writers() MNT_WRITE_HOLD will - * be set in @mnt_flags. The loop unsets MNT_WRITE_HOLD for all - * mounts and needs to take care to include the first mount. - */ - for (p = mnt; p; p = next_mnt(p, mnt)) { - /* If we had to hold writers unblock them. */ - if (p->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt_unhold_writers(p); - - /* - * We're done once the first mount we changed got - * MNT_WRITE_HOLD unset. - */ - if (p == m) - break; - } + /* undo all mnt_hold_writers() we'd done */ + for (struct mount *p = mnt; p != m; p = next_mnt(p, mnt)) + mnt_unhold_writers(p); } return err; } @@ -4832,8 +4819,7 @@ static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt) WRITE_ONCE(m->mnt.mnt_flags, flags); /* If we had to hold writers unblock them. */ - if (m->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt_unhold_writers(m); + mnt_unhold_writers(m); if (kattr->propagation) change_mnt_propagation(m, kattr->propagation); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (66 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 60/65] simplify the callers of mnt_unhold_writers() Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 61/65] setup_mnt(): primitive for connecting a mount to filesystem Al Viro ` (6 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds We have an unpleasant wart in accessibility rules for struct mount. There are per-superblock lists of mounts, used by sb_prepare_remount_readonly() to check if any of those is currently claimed for write access and to block further attempts to get write access on those until we are done. As soon as it is attached to a filesystem, mount becomes reachable via that list. Only sb_prepare_remount_readonly() traverses it and it only accesses a few members of struct mount. Unfortunately, ->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set and then cleared. It is done under mount_lock, so from the locking rules POV everything's fine. However, it has easily overlooked implications - once mount has been attached to a filesystem, it has to be treated as globally visible. In particular, initializing ->mnt_flags *must* be done either prior to that point or under mount_lock. All other members are still private at that point. Life gets simpler if we move that bit (and that's *all* that can get touched by access via this list) out of ->mnt_flags. It's not even hard to do - currently the list is implemented as list_head one, anchored in super_block->s_mounts and linked via mount->mnt_instance. As the first step, switch it to hlist-like open-coded structure - address of the first mount in the set is stored in ->s_mounts and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb - the former either NULL or pointing to the next mount in set, the latter - address of either ->s_mounts or ->mnt_next_for_sb in the previous element of the set. In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as replacement for MNT_WRITE_HOLD. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 4 +++- fs/namespace.c | 38 +++++++++++++++++++++++++++++--------- fs/super.c | 3 +-- include/linux/fs.h | 4 +++- 4 files changed, 36 insertions(+), 13 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index 04d0eadc4c10..b208f69f69d7 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -64,7 +64,9 @@ struct mount { #endif struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ - struct list_head mnt_instance; /* mount instance on sb->s_mounts */ + struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */ + struct mount * __aligned(1) *mnt_pprev_for_sb; + /* except that LSB of pprev will be stolen */ const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; struct list_head mnt_expire; /* link in fs-specific expiry list */ diff --git a/fs/namespace.c b/fs/namespace.c index c769fc4051e0..06be5b65b559 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -730,6 +730,27 @@ static inline void mnt_unhold_writers(struct mount *mnt) mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; } +static inline void mnt_del_instance(struct mount *m) +{ + struct mount **p = m->mnt_pprev_for_sb; + struct mount *next = m->mnt_next_for_sb; + + if (next) + next->mnt_pprev_for_sb = p; + *p = next; +} + +static inline void mnt_add_instance(struct mount *m, struct super_block *s) +{ + struct mount *first = s->s_mounts; + + if (first) + first->mnt_pprev_for_sb = &m->mnt_next_for_sb; + m->mnt_next_for_sb = first; + m->mnt_pprev_for_sb = &s->s_mounts; + s->s_mounts = m; +} + static int mnt_make_readonly(struct mount *mnt) { int ret; @@ -743,7 +764,6 @@ static int mnt_make_readonly(struct mount *mnt) int sb_prepare_remount_readonly(struct super_block *sb) { - struct mount *mnt; int err = 0; /* Racy optimization. Recheck the counter under MNT_WRITE_HOLD */ @@ -751,9 +771,9 @@ int sb_prepare_remount_readonly(struct super_block *sb) return -EBUSY; lock_mount_hash(); - list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { - if (!(mnt->mnt.mnt_flags & MNT_READONLY)) { - err = mnt_hold_writers(mnt); + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { + if (!(m->mnt.mnt_flags & MNT_READONLY)) { + err = mnt_hold_writers(m); if (err) break; } @@ -763,9 +783,9 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err) sb_start_ro_state_change(sb); - list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { - if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { + if (m->mnt.mnt_flags & MNT_WRITE_HOLD) + m->mnt.mnt_flags &= ~MNT_WRITE_HOLD; } unlock_mount_hash(); @@ -1207,7 +1227,7 @@ static void setup_mnt(struct mount *m, struct dentry *root) m->mnt_parent = m; lock_mount_hash(); - list_add_tail(&m->mnt_instance, &s->s_mounts); + mnt_add_instance(m, s); unlock_mount_hash(); } @@ -1425,7 +1445,7 @@ static void mntput_no_expire(struct mount *mnt) mnt->mnt.mnt_flags |= MNT_DOOMED; rcu_read_unlock(); - list_del(&mnt->mnt_instance); + mnt_del_instance(mnt); if (unlikely(!list_empty(&mnt->mnt_expire))) list_del(&mnt->mnt_expire); diff --git a/fs/super.c b/fs/super.c index 7f876f32343a..3b0f49e1b817 100644 --- a/fs/super.c +++ b/fs/super.c @@ -323,7 +323,6 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, if (!s) return NULL; - INIT_LIST_HEAD(&s->s_mounts); s->s_user_ns = get_user_ns(user_ns); init_rwsem(&s->s_umount); lockdep_set_class(&s->s_umount, &type->s_umount_key); @@ -408,7 +407,7 @@ static void __put_super(struct super_block *s) list_del_init(&s->s_list); WARN_ON(s->s_dentry_lru.node); WARN_ON(s->s_inode_lru.node); - WARN_ON(!list_empty(&s->s_mounts)); + WARN_ON(s->s_mounts); call_rcu(&s->rcu, destroy_super_rcu); } } diff --git a/include/linux/fs.h b/include/linux/fs.h index d7ab4f96d705..0e9c7f1460dc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1324,6 +1324,8 @@ struct sb_writers { struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; }; +struct mount; + struct super_block { struct list_head s_list; /* Keep this first */ dev_t s_dev; /* search index; _not_ kdev_t */ @@ -1358,7 +1360,7 @@ struct super_block { __u16 s_encoding_flags; #endif struct hlist_bl_head s_roots; /* alternate root dentries for NFS */ - struct list_head s_mounts; /* list of mounts; _not_ for fs use */ + struct mount *s_mounts; /* list of mounts; _not_ for fs use */ struct block_device *s_bdev; /* can go away once we use an accessor for @s_bdev_file */ struct file *s_bdev_file; struct backing_dev_info *s_bdi; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 61/65] setup_mnt(): primitive for connecting a mount to filesystem 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (67 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 62/65] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro ` (5 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Take the identical logics in vfs_create_mount() and clone_mnt() into a new helper that takes an empty struct mount and attaches it to given dentry (sub)tree. Should be called once in the lifetime of every mount, prior to making it visible in any data structures. After that point ->mnt_root and ->mnt_sb never change; ->mnt_root is a counting reference to dentry and ->mnt_sb - an active reference to superblock. Mount remains associated with that dentry tree all the way until the call of cleanup_mnt(), when the refcount eventually drops to zero. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index b4d287c0af4a..b7c317c23f69 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1196,6 +1196,21 @@ static void commit_tree(struct mount *mnt) touch_mnt_namespace(n); } +static void setup_mnt(struct mount *m, struct dentry *root) +{ + struct super_block *s = root->d_sb; + + atomic_inc(&s->s_active); + m->mnt.mnt_sb = s; + m->mnt.mnt_root = dget(root); + m->mnt_mountpoint = m->mnt.mnt_root; + m->mnt_parent = m; + + lock_mount_hash(); + list_add_tail(&m->mnt_instance, &s->s_mounts); + unlock_mount_hash(); +} + /** * vfs_create_mount - Create a mount for a configured superblock * @fc: The configuration context with the superblock attached @@ -1219,15 +1234,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc) if (fc->sb_flags & SB_KERNMOUNT) mnt->mnt.mnt_flags = MNT_INTERNAL; - atomic_inc(&fc->root->d_sb->s_active); - mnt->mnt.mnt_sb = fc->root->d_sb; - mnt->mnt.mnt_root = dget(fc->root); - mnt->mnt_mountpoint = mnt->mnt.mnt_root; - mnt->mnt_parent = mnt; + setup_mnt(mnt, fc->root); - lock_mount_hash(); - list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts); - unlock_mount_hash(); return &mnt->mnt; } EXPORT_SYMBOL(vfs_create_mount); @@ -1285,7 +1293,6 @@ EXPORT_SYMBOL_GPL(vfs_kern_mount); static struct mount *clone_mnt(struct mount *old, struct dentry *root, int flag) { - struct super_block *sb = old->mnt.mnt_sb; struct mount *mnt; int err; @@ -1310,16 +1317,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, if (mnt->mnt_group_id) set_mnt_shared(mnt); - atomic_inc(&sb->s_active); mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt)); - mnt->mnt.mnt_sb = sb; - mnt->mnt.mnt_root = dget(root); - mnt->mnt_mountpoint = mnt->mnt.mnt_root; - mnt->mnt_parent = mnt; - lock_mount_hash(); - list_add_tail(&mnt->mnt_instance, &sb->s_mounts); - unlock_mount_hash(); + setup_mnt(mnt, root); if (flag & CL_PRIVATE) // we are done with it return mnt; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 62/65] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (68 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 61/65] setup_mnt(): primitive for connecting a mount to filesystem Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 62/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro ` (4 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds We have an unpleasant wart in accessibility rules for struct mount. There are per-superblock lists of mounts, used by sb_prepare_remount_readonly() to check if any of those is currently claimed for write access and to block further attempts to get write access on those until we are done. As soon as it is attached to a filesystem, mount becomes reachable via that list. Only sb_prepare_remount_readonly() traverses it and it only accesses a few members of struct mount. Unfortunately, ->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set and then cleared. It is done under mount_lock, so from the locking rules POV everything's fine. However, it has easily overlooked implications - once mount has been attached to a filesystem, it has to be treated as globally visible. In particular, initializing ->mnt_flags *must* be done either prior to that point or under mount_lock. All other members are still private at that point. Life gets simpler if we move that bit (and that's *all* that can get touched by access via this list) out of ->mnt_flags. It's not even hard to do - currently the list is implemented as list_head one, anchored in super_block->s_mounts and linked via mount->mnt_instance. As the first step, switch it to hlist-like open-coded structure - address of the first mount in the set is stored in ->s_mounts and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb - the former either NULL or pointing to the next mount in set, the latter - address of either ->s_mounts or ->mnt_next_for_sb in the previous element of the set. In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as replacement for MNT_WRITE_HOLD. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 4 +++- fs/namespace.c | 38 +++++++++++++++++++++++++++++--------- fs/super.c | 3 +-- include/linux/fs.h | 4 +++- 4 files changed, 36 insertions(+), 13 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index 04d0eadc4c10..b208f69f69d7 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -64,7 +64,9 @@ struct mount { #endif struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ - struct list_head mnt_instance; /* mount instance on sb->s_mounts */ + struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */ + struct mount * __aligned(1) *mnt_pprev_for_sb; + /* except that LSB of pprev will be stolen */ const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; struct list_head mnt_expire; /* link in fs-specific expiry list */ diff --git a/fs/namespace.c b/fs/namespace.c index b7c317c23f69..eb1b557e9f6d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -730,6 +730,27 @@ static inline void mnt_unhold_writers(struct mount *mnt) mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; } +static inline void mnt_del_instance(struct mount *m) +{ + struct mount **p = m->mnt_pprev_for_sb; + struct mount *next = m->mnt_next_for_sb; + + if (next) + next->mnt_pprev_for_sb = p; + *p = next; +} + +static inline void mnt_add_instance(struct mount *m, struct super_block *s) +{ + struct mount *first = s->s_mounts; + + if (first) + first->mnt_pprev_for_sb = &m->mnt_next_for_sb; + m->mnt_next_for_sb = first; + m->mnt_pprev_for_sb = &s->s_mounts; + s->s_mounts = m; +} + static int mnt_make_readonly(struct mount *mnt) { int ret; @@ -743,7 +764,6 @@ static int mnt_make_readonly(struct mount *mnt) int sb_prepare_remount_readonly(struct super_block *sb) { - struct mount *mnt; int err = 0; /* Racy optimization. Recheck the counter under MNT_WRITE_HOLD */ @@ -751,9 +771,9 @@ int sb_prepare_remount_readonly(struct super_block *sb) return -EBUSY; lock_mount_hash(); - list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { - if (!(mnt->mnt.mnt_flags & MNT_READONLY)) { - err = mnt_hold_writers(mnt); + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { + if (!(m->mnt.mnt_flags & MNT_READONLY)) { + err = mnt_hold_writers(m); if (err) break; } @@ -763,9 +783,9 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err) sb_start_ro_state_change(sb); - list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { - if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { + if (m->mnt.mnt_flags & MNT_WRITE_HOLD) + m->mnt.mnt_flags &= ~MNT_WRITE_HOLD; } unlock_mount_hash(); @@ -1207,7 +1227,7 @@ static void setup_mnt(struct mount *m, struct dentry *root) m->mnt_parent = m; lock_mount_hash(); - list_add_tail(&m->mnt_instance, &s->s_mounts); + mnt_add_instance(m, s); unlock_mount_hash(); } @@ -1425,7 +1445,7 @@ static void mntput_no_expire(struct mount *mnt) mnt->mnt.mnt_flags |= MNT_DOOMED; rcu_read_unlock(); - list_del(&mnt->mnt_instance); + mnt_del_instance(mnt); if (unlikely(!list_empty(&mnt->mnt_expire))) list_del(&mnt->mnt_expire); diff --git a/fs/super.c b/fs/super.c index 7f876f32343a..3b0f49e1b817 100644 --- a/fs/super.c +++ b/fs/super.c @@ -323,7 +323,6 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, if (!s) return NULL; - INIT_LIST_HEAD(&s->s_mounts); s->s_user_ns = get_user_ns(user_ns); init_rwsem(&s->s_umount); lockdep_set_class(&s->s_umount, &type->s_umount_key); @@ -408,7 +407,7 @@ static void __put_super(struct super_block *s) list_del_init(&s->s_list); WARN_ON(s->s_dentry_lru.node); WARN_ON(s->s_inode_lru.node); - WARN_ON(!list_empty(&s->s_mounts)); + WARN_ON(s->s_mounts); call_rcu(&s->rcu, destroy_super_rcu); } } diff --git a/include/linux/fs.h b/include/linux/fs.h index d7ab4f96d705..0e9c7f1460dc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1324,6 +1324,8 @@ struct sb_writers { struct percpu_rw_semaphore rw_sem[SB_FREEZE_LEVELS]; }; +struct mount; + struct super_block { struct list_head s_list; /* Keep this first */ dev_t s_dev; /* search index; _not_ kdev_t */ @@ -1358,7 +1360,7 @@ struct super_block { __u16 s_encoding_flags; #endif struct hlist_bl_head s_roots; /* alternate root dentries for NFS */ - struct list_head s_mounts; /* list of mounts; _not_ for fs use */ + struct mount *s_mounts; /* list of mounts; _not_ for fs use */ struct block_device *s_bdev; /* can go away once we use an accessor for @s_bdev_file */ struct file *s_bdev_file; struct backing_dev_info *s_bdi; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 62/63] struct mount: relocate MNT_WRITE_HOLD bit 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (69 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 62/65] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 63/65] " Al Viro ` (3 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... from ->mnt_flags to LSB of ->mnt_pprev_for_sb. This is safe - we always set and clear it within the same mount_lock scope, so we won't interfere with list operations - traversals are always forward, so they don't even look at ->mnt_prev_for_sb and both insertions and removals are in mount_lock scopes of their own, so that bit will be clear in *all* mount instances during those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 25 ++++++++++++++++++++++++- fs/namespace.c | 34 +++++++++++++++++----------------- include/linux/mount.h | 3 +-- 3 files changed, 42 insertions(+), 20 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index b208f69f69d7..40cf16544317 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -66,7 +66,8 @@ struct mount { struct list_head mnt_child; /* and going through their mnt_child */ struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */ struct mount * __aligned(1) *mnt_pprev_for_sb; - /* except that LSB of pprev will be stolen */ + /* except that LSB of pprev is stolen */ +#define WRITE_HOLD 1 /* ... for use by mnt_hold_writers() */ const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; struct list_head mnt_expire; /* link in fs-specific expiry list */ @@ -244,4 +245,26 @@ static inline struct mount *topmost_overmount(struct mount *m) return m; } +static inline bool __test_write_hold(struct mount * __aligned(1) *val) +{ + return (unsigned long)val & WRITE_HOLD; +} + +static inline bool test_write_hold(const struct mount *m) +{ + return __test_write_hold(m->mnt_pprev_for_sb); +} + +static inline void set_write_hold(struct mount *m) +{ + m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb + | WRITE_HOLD); +} + +static inline void clear_write_hold(struct mount *m) +{ + m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb + & ~WRITE_HOLD); +} + struct mnt_namespace *mnt_ns_from_dentry(struct dentry *dentry); diff --git a/fs/namespace.c b/fs/namespace.c index 06be5b65b559..8e6b6523d3e8 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -509,20 +509,20 @@ int mnt_get_write_access(struct vfsmount *m) mnt_inc_writers(mnt); /* * The store to mnt_inc_writers must be visible before we pass - * MNT_WRITE_HOLD loop below, so that the slowpath can see our - * incremented count after it has set MNT_WRITE_HOLD. + * WRITE_HOLD loop below, so that the slowpath can see our + * incremented count after it has set WRITE_HOLD. */ smp_mb(); might_lock(&mount_lock.lock); - while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) { + while (__test_write_hold(READ_ONCE(mnt->mnt_pprev_for_sb))) { if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { cpu_relax(); } else { /* * This prevents priority inversion, if the task - * setting MNT_WRITE_HOLD got preempted on a remote + * setting WRITE_HOLD got preempted on a remote * CPU, and it prevents life lock if the task setting - * MNT_WRITE_HOLD has a lower priority and is bound to + * WRITE_HOLD has a lower priority and is bound to * the same CPU as the task that is spinning here. */ preempt_enable(); @@ -533,7 +533,7 @@ int mnt_get_write_access(struct vfsmount *m) } /* * The barrier pairs with the barrier sb_start_ro_state_change() making - * sure that if we see MNT_WRITE_HOLD cleared, we will also see + * sure that if we see WRITE_HOLD cleared, we will also see * s_readonly_remount set (or even SB_RDONLY / MNT_READONLY flags) in * mnt_is_readonly() and bail in case we are racing with remount * read-only. @@ -672,15 +672,15 @@ EXPORT_SYMBOL(mnt_drop_write_file); * @mnt. * * Context: This function expects lock_mount_hash() to be held serializing - * setting MNT_WRITE_HOLD. + * setting WRITE_HOLD. * Return: On success 0 is returned. * On error, -EBUSY is returned. */ static inline int mnt_hold_writers(struct mount *mnt) { - mnt->mnt.mnt_flags |= MNT_WRITE_HOLD; + set_write_hold(mnt); /* - * After storing MNT_WRITE_HOLD, we'll read the counters. This store + * After storing WRITE_HOLD, we'll read the counters. This store * should be visible before we do. */ smp_mb(); @@ -696,9 +696,9 @@ static inline int mnt_hold_writers(struct mount *mnt) * sum up each counter, if we read a counter before it is incremented, * but then read another CPU's count which it has been subsequently * decremented from -- we would see more decrements than we should. - * MNT_WRITE_HOLD protects against this scenario, because + * WRITE_HOLD protects against this scenario, because * mnt_want_write first increments count, then smp_mb, then spins on - * MNT_WRITE_HOLD, so it can't be decremented by another CPU while + * WRITE_HOLD, so it can't be decremented by another CPU while * we're counting up here. */ if (mnt_get_writers(mnt) > 0) @@ -720,14 +720,14 @@ static inline int mnt_hold_writers(struct mount *mnt) */ static inline void mnt_unhold_writers(struct mount *mnt) { - if (!(mnt->mnt_flags & MNT_WRITE_HOLD)) + if (!test_write_hold(mnt)) return; /* - * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers + * MNT_READONLY must become visible before ~WRITE_HOLD, so writers * that become unheld will see MNT_READONLY. */ smp_wmb(); - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + clear_write_hold(mnt); } static inline void mnt_del_instance(struct mount *m) @@ -766,7 +766,7 @@ int sb_prepare_remount_readonly(struct super_block *sb) { int err = 0; - /* Racy optimization. Recheck the counter under MNT_WRITE_HOLD */ + /* Racy optimization. Recheck the counter under WRITE_HOLD */ if (atomic_long_read(&sb->s_remove_count)) return -EBUSY; @@ -784,8 +784,8 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err) sb_start_ro_state_change(sb); for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { - if (m->mnt.mnt_flags & MNT_WRITE_HOLD) - m->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + if (test_write_hold(m)) + clear_write_hold(m); } unlock_mount_hash(); diff --git a/include/linux/mount.h b/include/linux/mount.h index 18e4b97f8a98..85e97b9340ff 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -33,7 +33,6 @@ enum mount_flags { MNT_NOSYMFOLLOW = 0x80, MNT_SHRINKABLE = 0x100, - MNT_WRITE_HOLD = 0x200, MNT_INTERNAL = 0x4000, @@ -52,7 +51,7 @@ enum mount_flags { | MNT_READONLY | MNT_NOSYMFOLLOW, MNT_ATIME_MASK = MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME, - MNT_INTERNAL_FLAGS = MNT_WRITE_HOLD | MNT_INTERNAL | MNT_DOOMED | + MNT_INTERNAL_FLAGS = MNT_INTERNAL | MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_LOCKED }; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 63/65] struct mount: relocate MNT_WRITE_HOLD bit 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (70 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 62/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount Al Viro ` (2 subsequent siblings) 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... from ->mnt_flags to LSB of ->mnt_pprev_for_sb. This is safe - we always set and clear it within the same mount_lock scope, so we won't interfere with list operations - traversals are always forward, so they don't even look at ->mnt_prev_for_sb and both insertions and removals are in mount_lock scopes of their own, so that bit will be clear in *all* mount instances during those. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/mount.h | 25 ++++++++++++++++++++++++- fs/namespace.c | 34 +++++++++++++++++----------------- include/linux/mount.h | 3 +-- 3 files changed, 42 insertions(+), 20 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index b208f69f69d7..40cf16544317 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -66,7 +66,8 @@ struct mount { struct list_head mnt_child; /* and going through their mnt_child */ struct mount *mnt_next_for_sb; /* the next two fields are hlist_node, */ struct mount * __aligned(1) *mnt_pprev_for_sb; - /* except that LSB of pprev will be stolen */ + /* except that LSB of pprev is stolen */ +#define WRITE_HOLD 1 /* ... for use by mnt_hold_writers() */ const char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; struct list_head mnt_expire; /* link in fs-specific expiry list */ @@ -244,4 +245,26 @@ static inline struct mount *topmost_overmount(struct mount *m) return m; } +static inline bool __test_write_hold(struct mount * __aligned(1) *val) +{ + return (unsigned long)val & WRITE_HOLD; +} + +static inline bool test_write_hold(const struct mount *m) +{ + return __test_write_hold(m->mnt_pprev_for_sb); +} + +static inline void set_write_hold(struct mount *m) +{ + m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb + | WRITE_HOLD); +} + +static inline void clear_write_hold(struct mount *m) +{ + m->mnt_pprev_for_sb = (void *)((unsigned long)m->mnt_pprev_for_sb + & ~WRITE_HOLD); +} + struct mnt_namespace *mnt_ns_from_dentry(struct dentry *dentry); diff --git a/fs/namespace.c b/fs/namespace.c index eb1b557e9f6d..64cbd8e8a1d3 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -509,20 +509,20 @@ int mnt_get_write_access(struct vfsmount *m) mnt_inc_writers(mnt); /* * The store to mnt_inc_writers must be visible before we pass - * MNT_WRITE_HOLD loop below, so that the slowpath can see our - * incremented count after it has set MNT_WRITE_HOLD. + * WRITE_HOLD loop below, so that the slowpath can see our + * incremented count after it has set WRITE_HOLD. */ smp_mb(); might_lock(&mount_lock.lock); - while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) { + while (__test_write_hold(READ_ONCE(mnt->mnt_pprev_for_sb))) { if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { cpu_relax(); } else { /* * This prevents priority inversion, if the task - * setting MNT_WRITE_HOLD got preempted on a remote + * setting WRITE_HOLD got preempted on a remote * CPU, and it prevents life lock if the task setting - * MNT_WRITE_HOLD has a lower priority and is bound to + * WRITE_HOLD has a lower priority and is bound to * the same CPU as the task that is spinning here. */ preempt_enable(); @@ -533,7 +533,7 @@ int mnt_get_write_access(struct vfsmount *m) } /* * The barrier pairs with the barrier sb_start_ro_state_change() making - * sure that if we see MNT_WRITE_HOLD cleared, we will also see + * sure that if we see WRITE_HOLD cleared, we will also see * s_readonly_remount set (or even SB_RDONLY / MNT_READONLY flags) in * mnt_is_readonly() and bail in case we are racing with remount * read-only. @@ -672,15 +672,15 @@ EXPORT_SYMBOL(mnt_drop_write_file); * @mnt. * * Context: This function expects lock_mount_hash() to be held serializing - * setting MNT_WRITE_HOLD. + * setting WRITE_HOLD. * Return: On success 0 is returned. * On error, -EBUSY is returned. */ static inline int mnt_hold_writers(struct mount *mnt) { - mnt->mnt.mnt_flags |= MNT_WRITE_HOLD; + set_write_hold(mnt); /* - * After storing MNT_WRITE_HOLD, we'll read the counters. This store + * After storing WRITE_HOLD, we'll read the counters. This store * should be visible before we do. */ smp_mb(); @@ -696,9 +696,9 @@ static inline int mnt_hold_writers(struct mount *mnt) * sum up each counter, if we read a counter before it is incremented, * but then read another CPU's count which it has been subsequently * decremented from -- we would see more decrements than we should. - * MNT_WRITE_HOLD protects against this scenario, because + * WRITE_HOLD protects against this scenario, because * mnt_want_write first increments count, then smp_mb, then spins on - * MNT_WRITE_HOLD, so it can't be decremented by another CPU while + * WRITE_HOLD, so it can't be decremented by another CPU while * we're counting up here. */ if (mnt_get_writers(mnt) > 0) @@ -720,14 +720,14 @@ static inline int mnt_hold_writers(struct mount *mnt) */ static inline void mnt_unhold_writers(struct mount *mnt) { - if (!(mnt->mnt_flags & MNT_WRITE_HOLD)) + if (!test_write_hold(mnt)) return; /* - * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers + * MNT_READONLY must become visible before ~WRITE_HOLD, so writers * that become unheld will see MNT_READONLY. */ smp_wmb(); - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + clear_write_hold(mnt); } static inline void mnt_del_instance(struct mount *m) @@ -766,7 +766,7 @@ int sb_prepare_remount_readonly(struct super_block *sb) { int err = 0; - /* Racy optimization. Recheck the counter under MNT_WRITE_HOLD */ + /* Racy optimization. Recheck the counter under WRITE_HOLD */ if (atomic_long_read(&sb->s_remove_count)) return -EBUSY; @@ -784,8 +784,8 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err) sb_start_ro_state_change(sb); for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { - if (m->mnt.mnt_flags & MNT_WRITE_HOLD) - m->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + if (test_write_hold(m)) + clear_write_hold(m); } unlock_mount_hash(); diff --git a/include/linux/mount.h b/include/linux/mount.h index 18e4b97f8a98..85e97b9340ff 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -33,7 +33,6 @@ enum mount_flags { MNT_NOSYMFOLLOW = 0x80, MNT_SHRINKABLE = 0x100, - MNT_WRITE_HOLD = 0x200, MNT_INTERNAL = 0x4000, @@ -52,7 +51,7 @@ enum mount_flags { | MNT_READONLY | MNT_NOSYMFOLLOW, MNT_ATIME_MASK = MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME, - MNT_INTERNAL_FLAGS = MNT_WRITE_HOLD | MNT_INTERNAL | MNT_DOOMED | + MNT_INTERNAL_FLAGS = MNT_INTERNAL | MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_LOCKED }; -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (71 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 63/65] " Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 64/65] " Al Viro 2025-09-03 4:55 ` [PATCH v3 65/65] constify {__,}mnt_is_readonly() Al Viro 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... neither for insertion into the list of instances, nor for mnt_{un,}hold_writers(), nor for mnt_get_write_access() deciding to be nice to RT during a busy-wait loop - all of that only needs the spinlock side of mount_lock. IOW, it's mount_locked_reader, not mount_writer. Clarify the comment re locking rules for mnt_unhold_writers() - it's not just that mount_lock needs to be held when calling that, it must have been held all along since the matching mnt_hold_writers(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 8e6b6523d3e8..8f0900857822 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -526,8 +526,8 @@ int mnt_get_write_access(struct vfsmount *m) * the same CPU as the task that is spinning here. */ preempt_enable(); - lock_mount_hash(); - unlock_mount_hash(); + read_seqlock_excl(&mount_lock); + read_sequnlock_excl(&mount_lock); preempt_disable(); } } @@ -671,7 +671,7 @@ EXPORT_SYMBOL(mnt_drop_write_file); * a call to mnt_unhold_writers() in order to stop preventing write access to * @mnt. * - * Context: This function expects lock_mount_hash() to be held serializing + * Context: This function expects to be in mount_locked_reader scope serializing * setting WRITE_HOLD. * Return: On success 0 is returned. * On error, -EBUSY is returned. @@ -716,7 +716,8 @@ static inline int mnt_hold_writers(struct mount *mnt) * * This function can only be called after a call to mnt_hold_writers(). * - * Context: This function expects lock_mount_hash() to be held. + * Context: This function expects to be in the same mount_locked_reader scope + * as the matching mnt_hold_writers(). */ static inline void mnt_unhold_writers(struct mount *mnt) { @@ -770,7 +771,8 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (atomic_long_read(&sb->s_remove_count)) return -EBUSY; - lock_mount_hash(); + guard(mount_locked_reader)(); + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { if (!(m->mnt.mnt_flags & MNT_READONLY)) { err = mnt_hold_writers(m); @@ -787,7 +789,6 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (test_write_hold(m)) clear_write_hold(m); } - unlock_mount_hash(); return err; } @@ -1226,9 +1227,8 @@ static void setup_mnt(struct mount *m, struct dentry *root) m->mnt_mountpoint = m->mnt.mnt_root; m->mnt_parent = m; - lock_mount_hash(); + guard(mount_locked_reader)(); mnt_add_instance(m, s); - unlock_mount_hash(); } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 64/65] WRITE_HOLD machinery: no need for to bump mount_lock seqcount 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (72 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount Al Viro @ 2025-09-03 4:55 ` Al Viro 2025-09-03 4:55 ` [PATCH v3 65/65] constify {__,}mnt_is_readonly() Al Viro 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds ... neither for insertion into the list of instances, nor for mnt_{un,}hold_writers(), nor for mnt_get_write_access() deciding to be nice to RT during a busy-wait loop - all of that only needs the spinlock side of mount_lock. IOW, it's mount_locked_reader, not mount_writer. Clarify the comment re locking rules for mnt_unhold_writers() - it's not just that mount_lock needs to be held when calling that, it must have been held all along since the matching mnt_hold_writers(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 64cbd8e8a1d3..9eef4ca6d36a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -526,8 +526,8 @@ int mnt_get_write_access(struct vfsmount *m) * the same CPU as the task that is spinning here. */ preempt_enable(); - lock_mount_hash(); - unlock_mount_hash(); + read_seqlock_excl(&mount_lock); + read_sequnlock_excl(&mount_lock); preempt_disable(); } } @@ -671,7 +671,7 @@ EXPORT_SYMBOL(mnt_drop_write_file); * a call to mnt_unhold_writers() in order to stop preventing write access to * @mnt. * - * Context: This function expects lock_mount_hash() to be held serializing + * Context: This function expects to be in mount_locked_reader scope serializing * setting WRITE_HOLD. * Return: On success 0 is returned. * On error, -EBUSY is returned. @@ -716,7 +716,8 @@ static inline int mnt_hold_writers(struct mount *mnt) * * This function can only be called after a call to mnt_hold_writers(). * - * Context: This function expects lock_mount_hash() to be held. + * Context: This function expects to be in the same mount_locked_reader scope + * as the matching mnt_hold_writers(). */ static inline void mnt_unhold_writers(struct mount *mnt) { @@ -770,7 +771,8 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (atomic_long_read(&sb->s_remove_count)) return -EBUSY; - lock_mount_hash(); + guard(mount_locked_reader)(); + for (struct mount *m = sb->s_mounts; m; m = m->mnt_next_for_sb) { if (!(m->mnt.mnt_flags & MNT_READONLY)) { err = mnt_hold_writers(m); @@ -787,7 +789,6 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (test_write_hold(m)) clear_write_hold(m); } - unlock_mount_hash(); return err; } @@ -1226,9 +1227,8 @@ static void setup_mnt(struct mount *m, struct dentry *root) m->mnt_mountpoint = m->mnt.mnt_root; m->mnt_parent = m; - lock_mount_hash(); + guard(mount_locked_reader)(); mnt_add_instance(m, s); - unlock_mount_hash(); } /** -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* [PATCH v3 65/65] constify {__,}mnt_is_readonly() 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro ` (73 preceding siblings ...) 2025-09-03 4:55 ` [PATCH v3 64/65] " Al Viro @ 2025-09-03 4:55 ` Al Viro 74 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 4:55 UTC (permalink / raw) To: linux-fsdevel; +Cc: brauner, jack, torvalds Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- fs/namespace.c | 4 ++-- include/linux/mount.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 9eef4ca6d36a..c88fe350b550 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -428,7 +428,7 @@ static struct mount *alloc_vfsmnt(const char *name) * mnt_want/drop_write() will _keep_ the filesystem * r/w. */ -bool __mnt_is_readonly(struct vfsmount *mnt) +bool __mnt_is_readonly(const struct vfsmount *mnt) { return (mnt->mnt_flags & MNT_READONLY) || sb_rdonly(mnt->mnt_sb); } @@ -468,7 +468,7 @@ static unsigned int mnt_get_writers(struct mount *mnt) #endif } -static int mnt_is_readonly(struct vfsmount *mnt) +static int mnt_is_readonly(const struct vfsmount *mnt) { if (READ_ONCE(mnt->mnt_sb->s_readonly_remount)) return 1; diff --git a/include/linux/mount.h b/include/linux/mount.h index 85e97b9340ff..acfe7ef86a1b 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -76,7 +76,7 @@ extern void mntput(struct vfsmount *mnt); extern struct vfsmount *mntget(struct vfsmount *mnt); extern void mnt_make_shortterm(struct vfsmount *mnt); extern struct vfsmount *mnt_clone_internal(const struct path *path); -extern bool __mnt_is_readonly(struct vfsmount *mnt); +extern bool __mnt_is_readonly(const struct vfsmount *mnt); extern bool mnt_may_suid(struct vfsmount *mnt); extern struct vfsmount *clone_private_mount(const struct path *path); -- 2.47.2 ^ permalink raw reply related [flat|nested] 321+ messages in thread
* Re: [PATCHES v3][RFC][CFT] mount-related stuff 2025-09-03 4:54 ` [PATCHES v3][RFC][CFT] mount-related stuff Al Viro 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro @ 2025-09-03 5:08 ` Al Viro 2025-09-03 14:47 ` Linus Torvalds 2 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 5:08 UTC (permalink / raw) To: linux-fsdevel; +Cc: Linus Torvalds, Christian Brauner, Jan Kara On Wed, Sep 03, 2025 at 05:54:32AM +0100, Al Viro wrote: > Branch force-pushed into > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.mount > (also visible as #v3.mount, #v[12].mount being the previous versions) > Individual patches in followups. > > If nobody objects, this goes into #for-next. PS: survives LTP, xfstests and mount-related selftests. FWIW, I've spent the weekend trying to figure out what's going on with generic/475. Turns out that it was not a regression - it goes back at least to 6.12 and it's triggered by PREEMPT vs. PREEMPT_VOLUNTARY in config. The former gives several kinds of failures, with total frequency about 8%; the latter apparently works - if any similar failures happen, the frequency is at least an order of magnitude lower. One useful thing I've got out of that is a bunch of helpers for doing bisect for configs - semi-manual decomposing the difference between two configs into a series of small changes, allowing to do bisection on that. Unfortunately, the change it has converged to (and repeating it alone on the original config reproduces the effect) is not particulary useful - some race gets triggered by a config change that affects timings all over the place ;-/ ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHES v3][RFC][CFT] mount-related stuff 2025-09-03 4:54 ` [PATCHES v3][RFC][CFT] mount-related stuff Al Viro 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-09-03 5:08 ` [PATCHES v3][RFC][CFT] mount-related stuff Al Viro @ 2025-09-03 14:47 ` Linus Torvalds 2025-09-03 18:14 ` Al Viro 2 siblings, 1 reply; 321+ messages in thread From: Linus Torvalds @ 2025-09-03 14:47 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, Christian Brauner, Jan Kara On Tue, 2 Sept 2025 at 21:54, Al Viro <viro@zeniv.linux.org.uk> wrote: > > If nobody objects, this goes into #for-next. Looks all sane to me. What was the issue with generic/475? I have missed that context.. Linus ^ permalink raw reply [flat|nested] 321+ messages in thread
* Re: [PATCHES v3][RFC][CFT] mount-related stuff 2025-09-03 14:47 ` Linus Torvalds @ 2025-09-03 18:14 ` Al Viro 0 siblings, 0 replies; 321+ messages in thread From: Al Viro @ 2025-09-03 18:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara On Wed, Sep 03, 2025 at 07:47:18AM -0700, Linus Torvalds wrote: > On Tue, 2 Sept 2025 at 21:54, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > If nobody objects, this goes into #for-next. > > Looks all sane to me. > > What was the issue with generic/475? I have missed that context.. At some point testing that branch has caught a failure in generic/475. Unfortunately, it wouldn't trigger on every run, so there was a possibility that it started earlier. When I went digging, I've found it with trixie kernel (6.12.38 in that kvm, at the time) rebuilt with my local config; the one used by debian didn't trigger that. Bisection by config converged to PREEMPT_VOLUNTARY (no visible failures) changed to PREEMPT (failures happen with odds a bit below 10%). There are several failure modes; the most common is something like ... echo '1' 2>&1 > /sys/fs/xfs/dm-0/error/fail_at_unmount echo '0' 2>&1 > /sys/fs/xfs/dm-0/error/metadata/EIO/max_retries echo '0' 2>&1 > /sys/fs/xfs/dm-0/error/metadata/EIO/retry_timeout_seconds fsstress: check_cwd stat64() returned -1 with errno: 5 (Input/output error) fsstress: check_cwd failure fsstress: check_cwd stat64() returned -1 with errno: 5 (Input/output error) fsstress: check_cwd failure fsstress: check_cwd stat64() returned -1 with errno: 5 (Input/output error) fsstress: check_cwd failure fsstress: check_cwd stat64() returned -1 with errno: 5 (Input/output error) fsstress: check_cwd failure fsstress killed (pid 10824) fsstress killed (pid 10826) fsstress killed (pid 10827) fsstress killed (pid 10828) fsstress killed (pid 10829) umount: /home/scratch: target is busy. unmount failed umount: /home/scratch: target is busy. umount: /dev/sdb2: not mounted. in the end of output (that's mainline v6.12); other variants include e.g. quietly hanging udevadm wait (killable). It's bloody annoying to bisect - 100-iterations run takes about 2.5 hours and while usually a failure happens in the first 40 minutes or so or not at all... PREEMPT definitely is the main contributor to the failure odds... I'm doing a bisection between v6.12 and v6.10 at the moment, will post when I get something more useful... ^ permalink raw reply [flat|nested] 321+ messages in thread
end of thread, other threads:[~2025-09-03 18:14 UTC | newest] Thread overview: 321+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-25 4:40 [PATCHED][RFC][CFT] mount-related stuff Al Viro 2025-08-25 4:43 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-08-25 4:43 ` [PATCH 02/52] introduced guards for mount_lock Al Viro 2025-08-25 12:32 ` Christian Brauner 2025-08-25 13:46 ` Al Viro 2025-08-25 20:21 ` Al Viro 2025-08-25 23:44 ` Al Viro 2025-08-26 1:44 ` Al Viro 2025-08-26 15:17 ` Askar Safin 2025-08-26 15:45 ` Al Viro 2025-08-25 4:43 ` [PATCH 03/52] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro 2025-08-25 12:33 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 04/52] __detach_mounts(): use guards Al Viro 2025-08-25 12:33 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 05/52] __is_local_mountpoint(): " Al Viro 2025-08-25 12:33 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 06/52] do_change_type(): " Al Viro 2025-08-25 12:34 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 07/52] do_set_group(): " Al Viro 2025-08-25 12:35 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 08/52] mark_mounts_for_expiry(): " Al Viro 2025-08-25 12:37 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 09/52] put_mnt_ns(): " Al Viro 2025-08-25 12:37 ` Christian Brauner 2025-08-25 12:40 ` Christian Brauner 2025-08-25 16:21 ` Al Viro 2025-08-25 4:43 ` [PATCH 10/52] mnt_already_visible(): " Al Viro 2025-08-25 12:39 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 11/52] check_for_nsfs_mounts(): no need to take locks Al Viro 2025-08-25 12:48 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 12/52] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro 2025-08-25 12:49 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 13/52] has_locked_children(): use guards Al Viro 2025-08-25 11:54 ` Linus Torvalds 2025-08-25 17:33 ` Al Viro 2025-08-25 12:49 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 14/52] mnt_set_expiry(): " Al Viro 2025-08-25 12:51 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 15/52] path_is_under(): " Al Viro 2025-08-25 12:56 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 16/52] current_chrooted(): don't bother with follow_down_one() Al Viro 2025-08-25 12:57 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 17/52] current_chrooted(): use guards Al Viro 2025-08-25 12:57 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 18/52] do_move_mount(): trim local variables Al Viro 2025-08-25 12:57 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 19/52] do_move_mount(): deal with the checks on old_path early Al Viro 2025-08-25 13:00 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 20/52] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro 2025-08-25 12:10 ` Linus Torvalds 2025-08-25 12:17 ` Linus Torvalds 2025-08-25 13:02 ` Christian Brauner 2025-08-25 16:05 ` Al Viro 2025-08-25 4:43 ` [PATCH 21/52] finish_automount(): simplify the ELOOP check Al Viro 2025-08-25 13:02 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 22/52] do_loopback(): use __free(path_put) to deal with old_path Al Viro 2025-08-25 13:02 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 23/52] pivot_root(2): use __free() to deal with struct path in it Al Viro 2025-08-25 13:03 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 24/52] finish_automount(): take the lock_mount() analogue into a helper Al Viro 2025-08-25 13:08 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 25/52] do_new_mount_rc(): use __free() to deal with dropping mnt on failure Al Viro 2025-08-25 13:29 ` Christian Brauner 2025-08-25 16:09 ` Al Viro 2025-08-26 8:27 ` Christian Brauner 2025-08-26 17:00 ` Al Viro 2025-08-26 17:55 ` Al Viro 2025-08-26 18:21 ` [RFC][PATCH] switch do_new_mount_fc() to using fc_mount() Al Viro 2025-08-27 15:38 ` Paul Moore 2025-08-25 4:43 ` [PATCH 26/52] finish_automount(): use __free() to deal with dropping mnt on failure Al Viro 2025-08-25 13:09 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 27/52] change calling conventions for lock_mount() et.al Al Viro 2025-08-25 4:43 ` [PATCH 28/52] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro 2025-08-25 4:43 ` [PATCH 29/52] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro 2025-08-25 4:43 ` [PATCH 30/52] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro 2025-08-25 4:43 ` [PATCH 31/52] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro 2025-08-25 13:43 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 32/52] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro 2025-08-25 4:43 ` [PATCH 33/52] new helper: topmost_overmount() Al Viro 2025-08-25 13:43 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 34/52] do_lock_mount(): don't modify path Al Viro 2025-08-26 14:14 ` Askar Safin 2025-08-25 4:43 ` [PATCH 35/52] constify check_mnt() Al Viro 2025-08-25 13:43 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 36/52] do_mount_setattr(): constify path argument Al Viro 2025-08-25 13:30 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 37/52] do_set_group(): constify path arguments Al Viro 2025-08-25 13:29 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 38/52] drop_collected_paths(): constify arguments Al Viro 2025-08-25 13:31 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 39/52] collect_paths(): constify the return value Al Viro 2025-08-25 13:30 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 40/52] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro 2025-08-25 13:30 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 41/52] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro 2025-08-25 13:32 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 42/52] do_new_mount{,_fc}(): " Al Viro 2025-08-25 13:30 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 43/52] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro 2025-08-25 13:31 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 44/52] path_mount(): " Al Viro 2025-08-25 13:32 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 45/52] may_copy_tree(), __do_loopback(): " Al Viro 2025-08-25 13:40 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 46/52] path_umount(): " Al Viro 2025-08-25 13:40 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 47/52] constify can_move_mount_beneath() arguments Al Viro 2025-08-25 13:39 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 48/52] do_move_mount_old(): use __free(path_put) Al Viro 2025-08-25 13:40 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 49/52] do_mount(): " Al Viro 2025-08-25 13:32 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 50/52] umount_tree(): take all victims out of propagation graph at once Al Viro 2025-08-25 4:43 ` [PATCH 51/52] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro 2025-08-25 13:41 ` Christian Brauner 2025-08-25 4:43 ` [PATCH 52/52] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro 2025-08-25 13:42 ` Christian Brauner 2025-08-25 12:30 ` [PATCH 01/52] fs/namespace.c: fix the namespace_sem guard mess Christian Brauner 2025-08-25 12:26 ` [PATCHED][RFC][CFT] mount-related stuff Christian Brauner 2025-08-25 12:43 ` Christian Brauner 2025-08-25 16:11 ` Al Viro 2025-08-25 17:43 ` Al Viro 2025-08-25 20:18 ` Theodore Ts'o 2025-08-26 8:56 ` Christian Brauner 2025-08-27 17:19 ` Linus Torvalds 2025-08-27 17:49 ` Linus Torvalds 2025-08-27 22:49 ` Konstantin Ryabitsev 2025-08-27 23:40 ` Linus Torvalds 2025-08-28 0:41 ` Konstantin Ryabitsev 2025-08-28 1:00 ` Al Viro 2025-08-28 1:15 ` Konstantin Ryabitsev 2025-08-28 1:29 ` Linus Torvalds 2025-08-29 12:30 ` Theodore Ts'o 2025-08-29 18:25 ` Konstantin Ryabitsev 2025-08-28 23:07 ` [PATCHES v2][RFC][CFT] " Al Viro 2025-08-28 23:07 ` [PATCH v2 01/63] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-08-28 23:07 ` [PATCH v2 02/63] introduced guards for mount_lock Al Viro 2025-08-29 9:49 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 03/63] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro 2025-08-28 23:07 ` [PATCH v2 04/63] __detach_mounts(): use guards Al Viro 2025-08-29 9:48 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 05/63] __is_local_mountpoint(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 06/63] do_change_type(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 07/63] do_set_group(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 08/63] mark_mounts_for_expiry(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 09/63] put_mnt_ns(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 10/63] mnt_already_visible(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 11/63] check_for_nsfs_mounts(): no need to take locks Al Viro 2025-08-28 23:07 ` [PATCH v2 12/63] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro 2025-08-28 23:07 ` [PATCH v2 13/63] has_locked_children(): use guards Al Viro 2025-08-29 9:49 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 14/63] mnt_set_expiry(): " Al Viro 2025-08-29 9:49 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 15/63] path_is_under(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 16/63] current_chrooted(): don't bother with follow_down_one() Al Viro 2025-08-28 23:07 ` [PATCH v2 17/63] current_chrooted(): use guards Al Viro 2025-08-28 23:07 ` [PATCH v2 18/63] switch do_new_mount_fc() to fc_mount() Al Viro 2025-08-29 9:53 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 19/63] do_move_mount(): trim local variables Al Viro 2025-08-28 23:07 ` [PATCH v2 20/63] do_move_mount(): deal with the checks on old_path early Al Viro 2025-08-28 23:07 ` [PATCH v2 21/63] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro 2025-08-28 23:07 ` [PATCH v2 22/63] finish_automount(): simplify the ELOOP check Al Viro 2025-08-28 23:07 ` [PATCH v2 23/63] do_loopback(): use __free(path_put) to deal with old_path Al Viro 2025-08-28 23:07 ` [PATCH v2 24/63] pivot_root(2): use __free() to deal with struct path in it Al Viro 2025-08-28 23:07 ` [PATCH v2 25/63] finish_automount(): take the lock_mount() analogue into a helper Al Viro 2025-08-28 23:07 ` [PATCH v2 26/63] do_new_mount_rc(): use __free() to deal with dropping mnt on failure Al Viro 2025-09-01 11:34 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 27/63] finish_automount(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 28/63] change calling conventions for lock_mount() et.al Al Viro 2025-09-01 11:37 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 29/63] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro 2025-09-01 11:38 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 30/63] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro 2025-09-01 11:40 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 31/63] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro 2025-09-01 11:41 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 32/63] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro 2025-08-28 23:07 ` [PATCH v2 33/63] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro 2025-08-28 23:20 ` Linus Torvalds 2025-08-28 23:39 ` Al Viro 2025-08-28 23:07 ` [PATCH v2 34/63] new helper: topmost_overmount() Al Viro 2025-08-28 23:07 ` [PATCH v2 35/63] do_lock_mount(): don't modify path Al Viro 2025-09-02 10:55 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 36/63] constify check_mnt() Al Viro 2025-08-28 23:07 ` [PATCH v2 37/63] do_mount_setattr(): constify path argument Al Viro 2025-08-28 23:07 ` [PATCH v2 38/63] do_set_group(): constify path arguments Al Viro 2025-08-28 23:07 ` [PATCH v2 39/63] drop_collected_paths(): constify arguments Al Viro 2025-08-28 23:07 ` [PATCH v2 40/63] collect_paths(): constify the return value Al Viro 2025-08-28 23:07 ` [PATCH v2 41/63] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro 2025-08-28 23:07 ` [PATCH v2 42/63] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro 2025-08-28 23:07 ` [PATCH v2 43/63] do_new_mount{,_fc}(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 44/63] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 45/63] path_mount(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 46/63] may_copy_tree(), __do_loopback(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 47/63] path_umount(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 48/63] constify can_move_mount_beneath() arguments Al Viro 2025-08-28 23:07 ` [PATCH v2 49/63] do_move_mount_old(): use __free(path_put) Al Viro 2025-08-28 23:07 ` [PATCH v2 50/63] do_mount(): " Al Viro 2025-08-28 23:07 ` [PATCH v2 51/63] umount_tree(): take all victims out of propagation graph at once Al Viro 2025-09-01 11:50 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 52/63] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro 2025-08-28 23:07 ` [PATCH v2 53/63] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro 2025-08-28 23:07 ` [PATCH v2 54/63] open_detached_copy(): don't bother with mount_lock_hash() Al Viro 2025-09-01 11:29 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 55/63] open_detached_copy(): separate creation of namespace into helper Al Viro 2025-08-29 9:54 ` Christian Brauner 2025-08-28 23:07 ` [PATCH v2 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro 2025-08-29 9:57 ` Christian Brauner 2025-08-28 23:08 ` [PATCH v2 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro 2025-08-29 9:56 ` Christian Brauner 2025-08-28 23:08 ` [PATCH v2 58/63] copy_mnt_ns(): use guards Al Viro 2025-09-01 11:43 ` Christian Brauner 2025-08-28 23:08 ` [PATCH v2 59/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro 2025-08-28 23:08 ` [PATCH v2 60/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro 2025-08-28 23:08 ` [PATCH v2 61/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro 2025-08-28 23:31 ` Linus Torvalds 2025-08-29 0:11 ` Al Viro 2025-08-29 0:35 ` Linus Torvalds 2025-08-29 6:03 ` Al Viro 2025-08-29 6:04 ` [59/63] simplify the callers of mnt_unhold_writers() Al Viro 2025-09-01 11:20 ` Christian Brauner 2025-08-29 6:05 ` [60/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro 2025-08-29 9:59 ` Christian Brauner 2025-08-29 16:37 ` Al Viro 2025-08-30 4:36 ` Al Viro 2025-08-30 7:33 ` [RFC] does # really need to be escaped in devnames? Al Viro 2025-08-30 19:40 ` Linus Torvalds 2025-08-30 20:42 ` Al Viro 2025-09-02 15:03 ` Siddhesh Poyarekar 2025-09-02 16:30 ` Linus Torvalds 2025-09-02 16:39 ` Siddhesh Poyarekar 2025-09-02 17:48 ` David Howells 2025-09-02 20:04 ` Linus Torvalds 2025-09-01 11:17 ` [60/63] setup_mnt(): primitive for connecting a mount to filesystem Christian Brauner 2025-08-29 6:06 ` [61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro 2025-09-01 11:27 ` Christian Brauner 2025-08-29 6:07 ` [62/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro 2025-09-01 11:26 ` Christian Brauner 2025-08-28 23:08 ` [PATCH v2 62/63] simplify the callers of mnt_unhold_writers() Al Viro 2025-08-28 23:08 ` [PATCH v2 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount Al Viro 2025-09-01 11:28 ` Christian Brauner 2025-09-03 4:54 ` [PATCHES v3][RFC][CFT] mount-related stuff Al Viro 2025-09-03 4:54 ` [PATCH v3 01/65] fs/namespace.c: fix the namespace_sem guard mess Al Viro 2025-09-03 4:54 ` [PATCH v3 02/65] introduced guards for mount_lock Al Viro 2025-09-03 4:54 ` [PATCH v3 03/65] fs/namespace.c: allow to drop vfsmount references via __free(mntput) Al Viro 2025-09-03 4:54 ` [PATCH v3 04/65] __detach_mounts(): use guards Al Viro 2025-09-03 4:54 ` [PATCH v3 05/65] __is_local_mountpoint(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 06/65] do_change_type(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 07/65] do_set_group(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 08/65] mark_mounts_for_expiry(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 09/65] put_mnt_ns(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 10/65] mnt_already_visible(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 11/65] check_for_nsfs_mounts(): no need to take locks Al Viro 2025-09-03 4:54 ` [PATCH v3 12/65] propagate_mnt(): use scoped_guard(mount_locked_reader) for mnt_set_mountpoint() Al Viro 2025-09-03 4:54 ` [PATCH v3 13/65] has_locked_children(): use guards Al Viro 2025-09-03 4:54 ` [PATCH v3 14/65] mnt_set_expiry(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 15/65] path_is_under(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 16/65] current_chrooted(): don't bother with follow_down_one() Al Viro 2025-09-03 4:54 ` [PATCH v3 17/65] current_chrooted(): use guards Al Viro 2025-09-03 4:54 ` [PATCH v3 18/65] switch do_new_mount_fc() to fc_mount() Al Viro 2025-09-03 4:54 ` [PATCH v3 19/65] do_move_mount(): trim local variables Al Viro 2025-09-03 4:54 ` [PATCH v3 20/65] do_move_mount(): deal with the checks on old_path early Al Viro 2025-09-03 4:54 ` [PATCH v3 21/65] move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() Al Viro 2025-09-03 4:54 ` [PATCH v3 22/65] finish_automount(): simplify the ELOOP check Al Viro 2025-09-03 4:54 ` [PATCH v3 23/65] do_loopback(): use __free(path_put) to deal with old_path Al Viro 2025-09-03 4:54 ` [PATCH v3 24/65] pivot_root(2): use __free() to deal with struct path in it Al Viro 2025-09-03 4:54 ` [PATCH v3 25/65] finish_automount(): take the lock_mount() analogue into a helper Al Viro 2025-09-03 4:54 ` [PATCH v3 26/65] do_new_mount_fc(): use __free() to deal with dropping mnt on failure Al Viro 2025-09-03 4:54 ` [PATCH v3 26/63] do_new_mount_rc(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 27/65] finish_automount(): " Al Viro 2025-09-03 4:54 ` [PATCH v3 28/65] change calling conventions for lock_mount() et.al Al Viro 2025-09-03 4:54 ` [PATCH v3 29/65] do_move_mount(): use the parent mount returned by do_lock_mount() Al Viro 2025-09-03 4:54 ` [PATCH v3 30/65] do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Al Viro 2025-09-03 4:54 ` [PATCH v3 31/65] graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint Al Viro 2025-09-03 4:54 ` [PATCH v3 32/65] pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry Al Viro 2025-09-03 4:54 ` [PATCH v3 33/65] don't bother passing new_path->dentry to can_move_mount_beneath() Al Viro 2025-09-03 4:54 ` [PATCH v3 34/65] new helper: topmost_overmount() Al Viro 2025-09-03 4:54 ` [PATCH v3 35/65] do_lock_mount(): don't modify path Al Viro 2025-09-03 4:54 ` [PATCH v3 36/65] constify check_mnt() Al Viro 2025-09-03 4:54 ` [PATCH v3 37/65] do_mount_setattr(): constify path argument Al Viro 2025-09-03 4:55 ` [PATCH v3 38/65] do_set_group(): constify path arguments Al Viro 2025-09-03 4:55 ` [PATCH v3 39/65] drop_collected_paths(): constify arguments Al Viro 2025-09-03 4:55 ` [PATCH v3 40/65] collect_paths(): constify the return value Al Viro 2025-09-03 4:55 ` [PATCH v3 41/65] do_move_mount(), vfs_move_mount(), do_move_mount_old(): constify struct path argument(s) Al Viro 2025-09-03 4:55 ` [PATCH v3 42/65] mnt_warn_timestamp_expiry(): constify struct path argument Al Viro 2025-09-03 4:55 ` [PATCH v3 43/65] do_new_mount{,_fc}(): " Al Viro 2025-09-03 4:55 ` [PATCH v3 44/65] do_{loopback,change_type,remount,reconfigure_mnt}(): " Al Viro 2025-09-03 4:55 ` [PATCH v3 45/65] path_mount(): " Al Viro 2025-09-03 4:55 ` [PATCH v3 46/65] may_copy_tree(), __do_loopback(): " Al Viro 2025-09-03 4:55 ` [PATCH v3 47/65] path_umount(): " Al Viro 2025-09-03 4:55 ` [PATCH v3 48/65] constify can_move_mount_beneath() arguments Al Viro 2025-09-03 4:55 ` [PATCH v3 49/65] do_move_mount_old(): use __free(path_put) Al Viro 2025-09-03 4:55 ` [PATCH v3 50/65] do_mount(): " Al Viro 2025-09-03 4:55 ` [PATCH v3 51/65] umount_tree(): take all victims out of propagation graph at once Al Viro 2025-09-03 4:55 ` [PATCH v3 52/65] ecryptfs: get rid of pointless mount references in ecryptfs dentries Al Viro 2025-09-03 4:55 ` [PATCH v3 53/65] fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() Al Viro 2025-09-03 4:55 ` [PATCH v3 54/63] open_detached_copy(): don't bother with mount_lock_hash() Al Viro 2025-09-03 4:55 ` [PATCH v3 54/65] path_has_submounts(): use guard(mount_locked_reader) Al Viro 2025-09-03 4:55 ` [PATCH v3 55/65] open_detached_copy(): don't bother with mount_lock_hash() Al Viro 2025-09-03 4:55 ` [PATCH v3 55/63] open_detached_copy(): separate creation of namespace into helper Al Viro 2025-09-03 4:55 ` [PATCH v3 56/63] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro 2025-09-03 4:55 ` [PATCH v3 56/65] open_detached_copy(): separate creation of namespace into helper Al Viro 2025-09-03 4:55 ` [PATCH v3 57/63] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro 2025-09-03 4:55 ` [PATCH v3 57/65] mnt_ns_tree_remove(): DTRT if mnt_ns had never been added to mnt_ns_list Al Viro 2025-09-03 4:55 ` [PATCH v3 58/63] copy_mnt_ns(): use guards Al Viro 2025-09-03 4:55 ` [PATCH v3 58/65] copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure Al Viro 2025-09-03 4:55 ` [PATCH v3 59/65] copy_mnt_ns(): use guards Al Viro 2025-09-03 4:55 ` [PATCH v3 59/63] simplify the callers of mnt_unhold_writers() Al Viro 2025-09-03 4:55 ` [PATCH v3 60/63] setup_mnt(): primitive for connecting a mount to filesystem Al Viro 2025-09-03 4:55 ` [PATCH v3 60/65] simplify the callers of mnt_unhold_writers() Al Viro 2025-09-03 4:55 ` [PATCH v3 61/63] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro 2025-09-03 4:55 ` [PATCH v3 61/65] setup_mnt(): primitive for connecting a mount to filesystem Al Viro 2025-09-03 4:55 ` [PATCH v3 62/65] preparations to taking MNT_WRITE_HOLD out of ->mnt_flags Al Viro 2025-09-03 4:55 ` [PATCH v3 62/63] struct mount: relocate MNT_WRITE_HOLD bit Al Viro 2025-09-03 4:55 ` [PATCH v3 63/65] " Al Viro 2025-09-03 4:55 ` [PATCH v3 63/63] WRITE_HOLD machinery: no need for to bump mount_lock seqcount Al Viro 2025-09-03 4:55 ` [PATCH v3 64/65] " Al Viro 2025-09-03 4:55 ` [PATCH v3 65/65] constify {__,}mnt_is_readonly() Al Viro 2025-09-03 5:08 ` [PATCHES v3][RFC][CFT] mount-related stuff Al Viro 2025-09-03 14:47 ` Linus Torvalds 2025-09-03 18:14 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).