linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] export mount info for "unmounted" mounts
@ 2025-10-02 12:48 Bhavik Sachdev
  2025-10-02 12:48 ` [PATCH 1/4] fs/namespace: add umount_mnt_ns mount namespace for unmounted mounts Bhavik Sachdev
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Bhavik Sachdev @ 2025-10-02 12:48 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner
  Cc: linux-fsdevel, linux-kernel, Aleksa Sarai, Bhavik Sachdev,
	Pavel Tikhomirov, Jan Kara, John Garry, Arnaldo Carvalho de Melo,
	Darrick J . Wong, Namhyung Kim, Ingo Molnar, Andrei Vagin,
	Alexander Mikhalitsyn

By "unmounted" mounts we mean mounts that have been unmounted using
umount2(mnt, MNT_DETACH) but we still have file descriptors to files on
that mount. We want to add the ability to handle such mounts in CRIU
(Checkpoint/Restore in Userspace).

Currently, we have no way to get mount info for these mounts as they do
not appear in /proc/<pid>/mountinfo and statmount does not work on them.

We solve this problem by introducing a new umount_mnt_ns for such
mounts. Instead of their namespace being NULL, they get added to this
umount_mnt_ns. This allows us to use statmount on such mounts and export
their mnt_ns_id through statx.

We use this patchset with CRIU to support checkpoint/restore of
"unmounted" mounts in this pull request:
https://github.com/checkpoint-restore/criu/pull/2754.

All these patches are also available in this branch on github:
https://github.com/bsach64/linux/tree/umount-mnt-ns-plus-statx

Bhavik Sachdev (2):
  statmount: allow for "unmounted" mounts
  fs/stat: export mnt_ns_id through statx

Pavel Tikhomirov (2):
  fs/namespace: add umount_mnt_ns mount namespace for unmounted mounts
  fs/namespace: add umounted mounts to umount_mnt_ns

 fs/d_path.c                     |  2 +-
 fs/mount.h                      | 10 +++++-
 fs/namespace.c                  | 62 ++++++++++++++++++++++++++++-----
 fs/stat.c                       | 15 ++++++--
 include/linux/proc_ns.h         |  1 +
 include/linux/stat.h            |  1 +
 include/uapi/linux/nsfs.h       |  1 +
 include/uapi/linux/stat.h       |  4 ++-
 tools/include/uapi/linux/stat.h |  4 ++-
 9 files changed, 85 insertions(+), 15 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] fs/namespace: add umount_mnt_ns mount namespace for unmounted mounts
  2025-10-02 12:48 [PATCH 0/4] export mount info for "unmounted" mounts Bhavik Sachdev
@ 2025-10-02 12:48 ` Bhavik Sachdev
  2025-10-02 12:48 ` [PATCH 2/4] fs/namespace: add umounted mounts to umount_mnt_ns Bhavik Sachdev
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Bhavik Sachdev @ 2025-10-02 12:48 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner
  Cc: linux-fsdevel, linux-kernel, Aleksa Sarai, Bhavik Sachdev,
	Pavel Tikhomirov, Jan Kara, John Garry, Arnaldo Carvalho de Melo,
	Darrick J . Wong, Namhyung Kim, Ingo Molnar, Andrei Vagin,
	Alexander Mikhalitsyn

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

We would like the ability to get mount info for mounts that have been
"unmounted" but still have open fds (umount2(mnt, MNT_DETACH)).

This patch introduces a new umount_mnt_ns to which these "unmounted"
mounts will be moved to instead of their mount namespaces being NULL.
We add this umount_mnt_ns to init_userns so all "umounted" mounts are
accessible via root userns only.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 fs/namespace.c            | 14 ++++++++++++++
 include/linux/proc_ns.h   |  1 +
 include/uapi/linux/nsfs.h |  1 +
 3 files changed, 16 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index ae6d1312b184..70fe01d810df 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -107,6 +107,9 @@ struct mount_kattr {
 struct kobject *fs_kobj __ro_after_init;
 EXPORT_SYMBOL_GPL(fs_kobj);
 
+struct mnt_namespace *umount_mnt_ns __ro_after_init;
+EXPORT_SYMBOL_GPL(umount_mnt_ns);
+
 /*
  * vfsmount lock may be taken for read to prevent changes to the
  * vfsmount hash, ie. during mountpoint lookups or walking back
@@ -6121,6 +6124,17 @@ static void __init init_mount_tree(void)
 	set_fs_root(current->fs, &root);
 
 	mnt_ns_tree_add(ns);
+
+	umount_mnt_ns = alloc_mnt_ns(&init_user_ns, true);
+	if (IS_ERR(umount_mnt_ns)) {
+		free_mnt_ns(ns);
+		panic("Can't allocate initial umount namespace");
+	}
+	umount_mnt_ns->seq = atomic64_inc_return(&mnt_ns_seq);
+	umount_mnt_ns->seq_origin = ns->seq;
+	umount_mnt_ns->ns.inum = PROC_UMNT_INIT_INO;
+
+	mnt_ns_tree_add(umount_mnt_ns);
 }
 
 void __init mnt_init(void)
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 4b20375f3783..c1e8edba862d 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -48,6 +48,7 @@ enum {
 	PROC_TIME_INIT_INO	= TIME_NS_INIT_INO,
 	PROC_NET_INIT_INO	= NET_NS_INIT_INO,
 	PROC_MNT_INIT_INO	= MNT_NS_INIT_INO,
+	PROC_UMNT_INIT_INO	= UMNT_NS_INIT_INO,
 };
 
 #ifdef CONFIG_PROC_FS
diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h
index 97d8d80d139f..8bb0df8954bb 100644
--- a/include/uapi/linux/nsfs.h
+++ b/include/uapi/linux/nsfs.h
@@ -51,6 +51,7 @@ enum init_ns_ino {
 	TIME_NS_INIT_INO	= 0xEFFFFFFAU,
 	NET_NS_INIT_INO		= 0xEFFFFFF9U,
 	MNT_NS_INIT_INO		= 0xEFFFFFF8U,
+	UMNT_NS_INIT_INO	= 0xEFFFFFF7U,
 };
 
 #endif /* __LINUX_NSFS_H */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] fs/namespace: add umounted mounts to umount_mnt_ns
  2025-10-02 12:48 [PATCH 0/4] export mount info for "unmounted" mounts Bhavik Sachdev
  2025-10-02 12:48 ` [PATCH 1/4] fs/namespace: add umount_mnt_ns mount namespace for unmounted mounts Bhavik Sachdev
@ 2025-10-02 12:48 ` Bhavik Sachdev
  2025-10-02 16:34   ` Al Viro
  2025-10-02 12:48 ` [PATCH 3/4] statmount: allow for "unmounted" mounts Bhavik Sachdev
  2025-10-02 12:48 ` [PATCH 4/4] fs/stat: export mnt_ns_id through statx Bhavik Sachdev
  3 siblings, 1 reply; 9+ messages in thread
From: Bhavik Sachdev @ 2025-10-02 12:48 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner
  Cc: linux-fsdevel, linux-kernel, Aleksa Sarai, Bhavik Sachdev,
	Pavel Tikhomirov, Jan Kara, John Garry, Arnaldo Carvalho de Melo,
	Darrick J . Wong, Namhyung Kim, Ingo Molnar, Andrei Vagin,
	Alexander Mikhalitsyn

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

This patch add "unmounted" mounts to umount_mnt_ns instead of mount
namespace being NULL for such mounts. This will allow us to later use
statmount to get mount info about these mounts.

We also introduce proper checks so that "unmounted" mounts are still
detected correctly.

We delete mounts from umount_mnt_ns when no references to them exist.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 fs/d_path.c    |  2 +-
 fs/mount.h     | 10 +++++++++-
 fs/namespace.c | 23 ++++++++++++++++++++---
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/fs/d_path.c b/fs/d_path.c
index bb365511066b..c6a4118899e1 100644
--- a/fs/d_path.c
+++ b/fs/d_path.c
@@ -119,7 +119,7 @@ static int __prepend_path(const struct dentry *dentry, const struct mount *mnt,
 			/* Global root */
 			mnt_ns = READ_ONCE(mnt->mnt_ns);
 			/* open-coded is_mounted() to use local mnt_ns */
-			if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns))
+			if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns) && !is_umount_ns(mnt_ns))
 				return 1;	// absolute root
 			else
 				return 2;	// detached or not attached yet
diff --git a/fs/mount.h b/fs/mount.h
index 97737051a8b9..03f8165939b4 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -122,10 +122,18 @@ static inline int mnt_has_parent(const struct mount *mnt)
 	return mnt != mnt->mnt_parent;
 }
 
+extern struct mnt_namespace *umount_mnt_ns;
+
+static inline bool is_umount_ns(struct mnt_namespace *ns)
+{
+	return ns == umount_mnt_ns;
+}
+
 static inline int is_mounted(struct vfsmount *mnt)
 {
+	struct mnt_namespace *ns = READ_ONCE(real_mount(mnt)->mnt_ns);
 	/* neither detached nor internal? */
-	return !IS_ERR_OR_NULL(real_mount(mnt)->mnt_ns);
+	return !IS_ERR_OR_NULL(ns) && !is_umount_ns(ns);
 }
 
 extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
diff --git a/fs/namespace.c b/fs/namespace.c
index 70fe01d810df..0b4be12c02de 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1016,7 +1016,7 @@ static inline bool check_anonymous_mnt(struct mount *mnt)
 {
 	u64 seq;
 
-	if (!is_anon_ns(mnt->mnt_ns))
+	if (!is_anon_ns(mnt->mnt_ns) || is_umount_ns(mnt->mnt_ns))
 		return false;
 
 	seq = mnt->mnt_ns->seq_origin;
@@ -1400,9 +1400,11 @@ static void mntput_no_expire(struct mount *mnt)
 {
 	LIST_HEAD(list);
 	int count;
+	struct mnt_namespace *ns;
 
 	rcu_read_lock();
-	if (likely(READ_ONCE(mnt->mnt_ns))) {
+	ns = READ_ONCE(mnt->mnt_ns);
+	if (likely(ns && !is_umount_ns(ns))) {
 		/*
 		 * Since we don't do lock_mount_hash() here,
 		 * ->mnt_ns can change under us.  However, if it's
@@ -1438,6 +1440,18 @@ static void mntput_no_expire(struct mount *mnt)
 	mnt->mnt.mnt_flags |= MNT_DOOMED;
 	rcu_read_unlock();
 
+	if (mnt_ns_attached(mnt)) {
+		struct mnt_namespace *ns;
+
+		move_from_ns(mnt);
+		ns = mnt->mnt_ns;
+		if (ns) {
+			ns->nr_mounts--;
+			__touch_mnt_namespace(ns);
+		}
+		mnt->mnt_ns = NULL;
+	}
+
 	list_del(&mnt->mnt_instance);
 	if (unlikely(!list_empty(&mnt->mnt_expire)))
 		list_del(&mnt->mnt_expire);
@@ -1885,6 +1899,9 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		 * namespace, etc.
 		 */
 		mnt_notify_add(p);
+
+		mnt_add_to_ns(umount_mnt_ns, p);
+		umount_mnt_ns->nr_mounts++;
 	}
 }
 
@@ -4804,7 +4821,7 @@ static int can_idmap_mount(const struct mount_kattr *kattr, struct mount *mnt)
 		return -EPERM;
 
 	/* Mount has already been visible in the filesystem hierarchy. */
-	if (!is_anon_ns(mnt->mnt_ns))
+	if (!is_anon_ns(mnt->mnt_ns) || is_umount_ns(mnt->mnt_ns))
 		return -EINVAL;
 
 	return 0;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] statmount: allow for "unmounted" mounts
  2025-10-02 12:48 [PATCH 0/4] export mount info for "unmounted" mounts Bhavik Sachdev
  2025-10-02 12:48 ` [PATCH 1/4] fs/namespace: add umount_mnt_ns mount namespace for unmounted mounts Bhavik Sachdev
  2025-10-02 12:48 ` [PATCH 2/4] fs/namespace: add umounted mounts to umount_mnt_ns Bhavik Sachdev
@ 2025-10-02 12:48 ` Bhavik Sachdev
  2025-10-02 12:48 ` [PATCH 4/4] fs/stat: export mnt_ns_id through statx Bhavik Sachdev
  3 siblings, 0 replies; 9+ messages in thread
From: Bhavik Sachdev @ 2025-10-02 12:48 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner
  Cc: linux-fsdevel, linux-kernel, Aleksa Sarai, Bhavik Sachdev,
	Pavel Tikhomirov, Jan Kara, John Garry, Arnaldo Carvalho de Melo,
	Darrick J . Wong, Namhyung Kim, Ingo Molnar, Andrei Vagin,
	Alexander Mikhalitsyn

With "unmounted" mounts getting added to a separate umount_mnt_ns, we
need special handling in statmount in order for it to work on
"unmounted" mounts.

unmount_mnt_ns has no root mount (it doesn't really make sense for it to
have one) and "unmounted" mounts have no mountpoint. We handle both
these things in statmount and output the mountpoint as "[detached]" in
case of an "unmounted" mount.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
---
 fs/namespace.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 0b4be12c02de..29d0e692b365 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -5365,6 +5365,12 @@ static int statmount_mnt_root(struct kstatmount *s, struct seq_file *seq)
 	return 0;
 }
 
+static int statmount_mnt_point_detached(struct kstatmount *s, struct seq_file *seq)
+{
+	seq_puts(seq, "[detached]");
+	return 0;
+}
+
 static int statmount_mnt_point(struct kstatmount *s, struct seq_file *seq)
 {
 	struct vfsmount *mnt = s->mnt;
@@ -5589,7 +5595,11 @@ static int statmount_string(struct kstatmount *s, u64 flag)
 		break;
 	case STATMOUNT_MNT_POINT:
 		offp = &sm->mnt_point;
-		ret = statmount_mnt_point(s, seq);
+		if (!s->root.mnt && !s->root.dentry)
+			/* detached mount case */
+			ret = statmount_mnt_point_detached(s, seq);
+		else
+			ret = statmount_mnt_point(s, seq);
 		break;
 	case STATMOUNT_MNT_OPTS:
 		offp = &sm->mnt_opts;
@@ -5743,17 +5753,20 @@ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id,
 	if (!s->mnt)
 		return -ENOENT;
 
-	err = grab_requested_root(ns, &root);
-	if (err)
-		return err;
+	if (!is_umount_ns(ns)) {
+		err = grab_requested_root(ns, &root);
+		if (err)
+			return err;
+	}
 
 	/*
 	 * Don't trigger audit denials. We just want to determine what
 	 * mounts to show users.
 	 */
 	m = real_mount(s->mnt);
-	if (!is_path_reachable(m, m->mnt.mnt_root, &root) &&
-	    !ns_capable_noaudit(ns->user_ns, CAP_SYS_ADMIN))
+
+	if (!is_umount_ns(ns) && !is_path_reachable(m, m->mnt.mnt_root, &root) &&
+		!ns_capable_noaudit(ns->user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	err = security_sb_statfs(s->mnt->mnt_root);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] fs/stat: export mnt_ns_id through statx
  2025-10-02 12:48 [PATCH 0/4] export mount info for "unmounted" mounts Bhavik Sachdev
                   ` (2 preceding siblings ...)
  2025-10-02 12:48 ` [PATCH 3/4] statmount: allow for "unmounted" mounts Bhavik Sachdev
@ 2025-10-02 12:48 ` Bhavik Sachdev
  3 siblings, 0 replies; 9+ messages in thread
From: Bhavik Sachdev @ 2025-10-02 12:48 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner
  Cc: linux-fsdevel, linux-kernel, Aleksa Sarai, Bhavik Sachdev,
	Pavel Tikhomirov, Jan Kara, John Garry, Arnaldo Carvalho de Melo,
	Darrick J . Wong, Namhyung Kim, Ingo Molnar, Andrei Vagin,
	Alexander Mikhalitsyn

Currently, we have no good way to get mount info for the mount a given
file descriptor is on. statx only provides it's mnt_id but we would have to
call statmount with every single mnt_ns_id in order to get mount info
since we have no way of knowing the mount namespace this mnt_id belongs
to.

This patch modifies statx to also export mnt_ns_id, allowing userspace
to easily get mount info for the mount a file descriptor is on by using
a combination of statx + statmount.

```
statx(fd, “”, AT_EMPTY_PATH,
	STATX_MNT_ID_UNIQUE | STATX_MNT_NS_ID, &stat);
struct mnt_id_req req = {
	.mnt_id = stat.stx_mnt_id,
	.mnt_ns_id = stat.stx_mnt_ns_id
};
statmount(&req, &statmount_buf, buf_size, 0);
```

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
---
 fs/stat.c                       | 15 +++++++++++++--
 include/linux/stat.h            |  1 +
 include/uapi/linux/stat.h       |  4 +++-
 tools/include/uapi/linux/stat.h |  4 +++-
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index f95c1dc3eaa4..1cc29946c0c3 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -296,18 +296,28 @@ static int statx_lookup_flags(int flags)
 static int vfs_statx_path(struct path *path, int flags, struct kstat *stat,
 			  u32 request_mask)
 {
+	struct mount *real_mnt;
 	int error = vfs_getattr(path, stat, request_mask, flags);
 	if (error)
 		return error;
 
+	real_mnt = real_mount(path->mnt);
+
 	if (request_mask & STATX_MNT_ID_UNIQUE) {
-		stat->mnt_id = real_mount(path->mnt)->mnt_id_unique;
+		stat->mnt_id = real_mnt->mnt_id_unique;
 		stat->result_mask |= STATX_MNT_ID_UNIQUE;
 	} else {
-		stat->mnt_id = real_mount(path->mnt)->mnt_id;
+		stat->mnt_id = real_mnt->mnt_id;
 		stat->result_mask |= STATX_MNT_ID;
 	}
 
+	if (request_mask & STATX_MNT_NS_ID) {
+		if (!real_mnt->mnt_ns)
+			/* returning EINVAL for now */
+			return -EINVAL;
+		stat->mnt_ns_id = real_mnt->mnt_ns->seq;
+	}
+
 	if (path_mounted(path))
 		stat->attributes |= STATX_ATTR_MOUNT_ROOT;
 	stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT;
@@ -745,6 +755,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
 	tmp.stx_atomic_write_segments_max = stat->atomic_write_segments_max;
 	tmp.stx_atomic_write_unit_max_opt = stat->atomic_write_unit_max_opt;
+	tmp.stx_mnt_ns_id = stat->mnt_ns_id;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
diff --git a/include/linux/stat.h b/include/linux/stat.h
index e3d00e7bb26d..c62b70ce30d9 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -59,6 +59,7 @@ struct kstat {
 	u32		atomic_write_unit_max;
 	u32		atomic_write_unit_max_opt;
 	u32		atomic_write_segments_max;
+	u64		mnt_ns_id;
 };
 
 /* These definitions are internal to the kernel for now. Mainly used by nfsd. */
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 1686861aae20..1071f5097d26 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -187,7 +187,8 @@ struct statx {
 	__u32	__spare2[1];
 
 	/* 0xc0 */
-	__u64	__spare3[8];	/* Spare space for future expansion */
+	__u64   stx_mnt_ns_id;
+	__u64	__spare3[7];	/* Spare space for future expansion */
 
 	/* 0x100 */
 };
@@ -219,6 +220,7 @@ struct statx {
 #define STATX_SUBVOL		0x00008000U	/* Want/got stx_subvol */
 #define STATX_WRITE_ATOMIC	0x00010000U	/* Want/got atomic_write_* fields */
 #define STATX_DIO_READ_ALIGN	0x00020000U	/* Want/got dio read alignment info */
+#define STATX_MNT_NS_ID		0x00040000U	/* Want/got stx_mnt_ns_id */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
diff --git a/tools/include/uapi/linux/stat.h b/tools/include/uapi/linux/stat.h
index 1686861aae20..1071f5097d26 100644
--- a/tools/include/uapi/linux/stat.h
+++ b/tools/include/uapi/linux/stat.h
@@ -187,7 +187,8 @@ struct statx {
 	__u32	__spare2[1];
 
 	/* 0xc0 */
-	__u64	__spare3[8];	/* Spare space for future expansion */
+	__u64   stx_mnt_ns_id;
+	__u64	__spare3[7];	/* Spare space for future expansion */
 
 	/* 0x100 */
 };
@@ -219,6 +220,7 @@ struct statx {
 #define STATX_SUBVOL		0x00008000U	/* Want/got stx_subvol */
 #define STATX_WRITE_ATOMIC	0x00010000U	/* Want/got atomic_write_* fields */
 #define STATX_DIO_READ_ALIGN	0x00020000U	/* Want/got dio read alignment info */
+#define STATX_MNT_NS_ID		0x00040000U	/* Want/got stx_mnt_ns_id */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] fs/namespace: add umounted mounts to umount_mnt_ns
  2025-10-02 12:48 ` [PATCH 2/4] fs/namespace: add umounted mounts to umount_mnt_ns Bhavik Sachdev
@ 2025-10-02 16:34   ` Al Viro
  2025-10-03  5:03     ` Pavel Tikhomirov
  0 siblings, 1 reply; 9+ messages in thread
From: Al Viro @ 2025-10-02 16:34 UTC (permalink / raw)
  To: Bhavik Sachdev
  Cc: Christian Brauner, linux-fsdevel, linux-kernel, Aleksa Sarai,
	Pavel Tikhomirov, Jan Kara, John Garry, Arnaldo Carvalho de Melo,
	Darrick J . Wong, Namhyung Kim, Ingo Molnar, Andrei Vagin,
	Alexander Mikhalitsyn

On Thu, Oct 02, 2025 at 06:18:38PM +0530, Bhavik Sachdev wrote:

> @@ -1438,6 +1440,18 @@ static void mntput_no_expire(struct mount *mnt)
>  	mnt->mnt.mnt_flags |= MNT_DOOMED;
>  	rcu_read_unlock();
>  
> +	if (mnt_ns_attached(mnt)) {
> +		struct mnt_namespace *ns;
> +
> +		move_from_ns(mnt);
> +		ns = mnt->mnt_ns;
> +		if (ns) {
> +			ns->nr_mounts--;
> +			__touch_mnt_namespace(ns);
> +		}
> +		mnt->mnt_ns = NULL;
> +	}

Sorry, no.  You are introducing very special locking for one namespace's rbtree.
Not gonna fly.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] fs/namespace: add umounted mounts to umount_mnt_ns
  2025-10-02 16:34   ` Al Viro
@ 2025-10-03  5:03     ` Pavel Tikhomirov
  2025-10-06 13:45       ` Christian Brauner
  0 siblings, 1 reply; 9+ messages in thread
From: Pavel Tikhomirov @ 2025-10-03  5:03 UTC (permalink / raw)
  To: Al Viro, Bhavik Sachdev
  Cc: Christian Brauner, linux-fsdevel, linux-kernel, Aleksa Sarai,
	Jan Kara, John Garry, Arnaldo Carvalho de Melo, Darrick J . Wong,
	Namhyung Kim, Ingo Molnar, Andrei Vagin, Alexander Mikhalitsyn



On 10/3/25 00:34, Al Viro wrote:
> On Thu, Oct 02, 2025 at 06:18:38PM +0530, Bhavik Sachdev wrote:
> 
>> @@ -1438,6 +1440,18 @@ static void mntput_no_expire(struct mount *mnt)
>>   	mnt->mnt.mnt_flags |= MNT_DOOMED;
>>   	rcu_read_unlock();
>>   
>> +	if (mnt_ns_attached(mnt)) {
>> +		struct mnt_namespace *ns;
>> +
>> +		move_from_ns(mnt);
>> +		ns = mnt->mnt_ns;
>> +		if (ns) {
>> +			ns->nr_mounts--;
>> +			__touch_mnt_namespace(ns);
>> +		}
>> +		mnt->mnt_ns = NULL;
>> +	}
> 
> Sorry, no.  You are introducing very special locking for one namespace's rbtree.
> Not gonna fly.
> 
> NAKed-by: Al Viro <viro@zeniv.linux.org.uk>

Thank you for looking into it.

Sorry, we didn't have any intent to break locking, we would like to 
improve/rework the patch if we can.

1) I see that I missed that __touch_mnt_namespace requires vfsmount lock 
(according to it's comment) and we don't have it in this code path, so 
that is one problem.

I also see that it sends wake up for mounts_poll() waiters, but since 
no-one can join umount_mnt_ns, there is no point in sending wakeup as 
no-one can poll on /proc/mounts for this namespace. So we can remove the 
use of __touch_mnt_namespace seemingly safely and remove this incorrect 
locking case.

2) Another thing is, previously when we were at this point in code we 
were already out of namespace rbtree strictly before the reference from 
namespace was put (namespace_unlock()->mntput()). So no-one could've 
lookup-ed us, but after this change one can lookup us from umount_mnt_ns 
rbtree while we are in mntput_no_expire().

This one is a hard one, in this implementation at the minimum we can end 
up using the mount after it was freed due to this.

Previously mount lookup was protected by namespace_sem, and now when I 
use move_from_ns out of namespace_sem this protection is broken.

One stupid idea to fix it is to leave one reference to mount from 
detatched mntns, and have an asynchronous mechanism which detects last 
reference (in mntput_no_expire) and (under namespace_sem) first 
disconnects mount from umount_mnt_ns and only then calls final mntput.

We will think more on this one, maybe we will come up with something 
smarter.

3) We had an alternative approach to make unmounted mounts mountinfo 
visible for the user, by allowing fd-based statmount() 
https://github.com/bsach64/linux/commit/ac0c03d44fb1e6f0745aec81079fca075e75b354

But we also recognize a problem with it that it would require getting 
mountinfo from fd which is not root dentry of the mount, but any dentry 
(as we (CRIU) don't really have an option to choose which fd will be 
given to us).

I share this in case, maybe, you like it more than adding umount_mnt_ns.

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] fs/namespace: add umounted mounts to umount_mnt_ns
  2025-10-03  5:03     ` Pavel Tikhomirov
@ 2025-10-06 13:45       ` Christian Brauner
  2025-10-07 18:40         ` Aleksa Sarai
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2025-10-06 13:45 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Al Viro, Bhavik Sachdev, linux-fsdevel, linux-kernel,
	Aleksa Sarai, Jan Kara, John Garry, Arnaldo Carvalho de Melo,
	Darrick J . Wong, Namhyung Kim, Ingo Molnar, Andrei Vagin,
	Alexander Mikhalitsyn

On Fri, Oct 03, 2025 at 01:03:46PM +0800, Pavel Tikhomirov wrote:
> 
> 
> On 10/3/25 00:34, Al Viro wrote:
> > On Thu, Oct 02, 2025 at 06:18:38PM +0530, Bhavik Sachdev wrote:
> > 
> > > @@ -1438,6 +1440,18 @@ static void mntput_no_expire(struct mount *mnt)
> > >   	mnt->mnt.mnt_flags |= MNT_DOOMED;
> > >   	rcu_read_unlock();
> > > +	if (mnt_ns_attached(mnt)) {
> > > +		struct mnt_namespace *ns;
> > > +
> > > +		move_from_ns(mnt);
> > > +		ns = mnt->mnt_ns;
> > > +		if (ns) {
> > > +			ns->nr_mounts--;
> > > +			__touch_mnt_namespace(ns);
> > > +		}
> > > +		mnt->mnt_ns = NULL;
> > > +	}
> > 
> > Sorry, no.  You are introducing very special locking for one namespace's rbtree.
> > Not gonna fly.
> > 
> > NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> Thank you for looking into it.
> 
> Sorry, we didn't have any intent to break locking, we would like to
> improve/rework the patch if we can.
> 
> 1) I see that I missed that __touch_mnt_namespace requires vfsmount lock
> (according to it's comment) and we don't have it in this code path, so that
> is one problem.
> 
> I also see that it sends wake up for mounts_poll() waiters, but since no-one
> can join umount_mnt_ns, there is no point in sending wakeup as no-one can
> poll on /proc/mounts for this namespace. So we can remove the use of
> __touch_mnt_namespace seemingly safely and remove this incorrect locking
> case.
> 
> 2) Another thing is, previously when we were at this point in code we were
> already out of namespace rbtree strictly before the reference from namespace
> was put (namespace_unlock()->mntput()). So no-one could've lookup-ed us, but
> after this change one can lookup us from umount_mnt_ns rbtree while we are
> in mntput_no_expire().
> 
> This one is a hard one, in this implementation at the minimum we can end up
> using the mount after it was freed due to this.
> 
> Previously mount lookup was protected by namespace_sem, and now when I use
> move_from_ns out of namespace_sem this protection is broken.
> 
> One stupid idea to fix it is to leave one reference to mount from detatched
> mntns, and have an asynchronous mechanism which detects last reference (in
> mntput_no_expire) and (under namespace_sem) first disconnects mount from
> umount_mnt_ns and only then calls final mntput.
> 
> We will think more on this one, maybe we will come up with something
> smarter.
> 
> 3) We had an alternative approach to make unmounted mounts mountinfo visible
> for the user, by allowing fd-based statmount() https://github.com/bsach64/linux/commit/ac0c03d44fb1e6f0745aec81079fca075e75b354
> 
> But we also recognize a problem with it that it would require getting
> mountinfo from fd which is not root dentry of the mount, but any dentry (as
> we (CRIU) don't really have an option to choose which fd will be given to
> us).

The part about this just using an fd is - supresses gag reflex - fine.
We do that with the mount namespaces for listmount() already via
mnt_req->spare.

The part that I dislike is exactly the one you pointed out: using an
arbitrary fd to retrieve information about the mount but it's probably
something we can live with since the alternative is complicating the
lifetime rules of the mount and namespace interaction.

I had thought about a way to tie the _internal_ lifetime of the
namespace to the lifetime of unmounted mounts through the passive
reference count by moving them to a separate rb_root unmounted in the
namespace instance.

This would mean we'd have the owning namespace information around and
we'd also don't have to have any separate namespace around. The ->mnt_ns
field could work exactly the same. But alongside the ->mnt_ns pointer
we'd also have the ->mnt_ns_id of the container mount namespace stored.

On umount we'd move the mount to the unmounted mount tree in the same
namespace instance and take a passive reference to it. IOW, the
unmounted mounts keep the namespace alive but only internally.

So basically - handwaving - like this:

diff --git a/fs/mount.h b/fs/mount.h
index 97737051a8b9..4d3db03c8a82 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -14,6 +14,7 @@ struct mnt_namespace {
                struct rb_root  mounts;          /* Protected by namespace_sem */
                struct rb_node  *mnt_last_node;  /* last (rightmost) mount in the rbtree */
                struct rb_node  *mnt_first_node; /* first (leftmost) mount in the rbtree */
+               struct rb_root  unmounted;       /* unmounted mounts that are still active */
        };
        struct user_namespace   *user_ns;
        struct ucounts          *ucounts;
@@ -72,7 +73,10 @@ struct mount {
        struct hlist_head mnt_slave_list;/* list of slave mounts */
        struct hlist_node mnt_slave;    /* slave list entry */
        struct mount *mnt_master;       /* slave is on master->mnt_slave_list */
-       struct mnt_namespace *mnt_ns;   /* containing namespace */
+       struct {
+               struct mnt_namespace *mnt_ns;   /* containing namespace */
+               u64 mnt_ns_id;                  /* id of the containing mount namespace */
+       }
        struct mountpoint *mnt_mp;      /* where is it mounted */
        union {
                struct hlist_node mnt_mp_list;  /* list mounts with the same mountpoint */

->mnt_ns NULL still means as before that this is unmounted. The
containing namespace is alive (internally) and can still be looked up
via mnt->mnt_ns_id in the namespace tree. I stopped thinking here
because this has severe drawbacks:

* The scope of the namespace semaphore has to be extended to cover these
  mounts as well possibly even having to take it in cleanup_mnt() which
  is ugly and probably performance sensitive as it increases the
  codepaths that hammer on the semaphore.

  Alternative is a separate locking scheme. And if it's one thing that
  we don't need is another complex locking scheme in this code.

* The passive lifetime of the namespace would have to cover unmounted
  mounts which betrays the intent of the passive reference count. That's
  not supposed to regulate lifetimes beyond the namespace struct itself
  nor be bound to other objects in complex ways.

  Possibly we'd also have to tie the lifetime of the owning userns to
  the passive count so permission checking just works out of the box
  (One could also put that on the plus side of things but meh.).

* It's very weird to be able to statmount() unmounted mounts via the
  mount id if the actual mount namespace is indeed already dead.

  To put it another way: listmount() wouldn't surface the mount anymore
  - and rightly so - because it's not tied to any mount namespace
  anymore. But somehow we magically synthesize it into existence via
  statmount().

* While a mount is attached to a mount namespace said namespace is the
  rightful owner of the mount. Once the mount is unmounted that
  ownership is moved from the mount namespace and becomes inherently
  bound to processes that hold an active reference to that mount via
  file descriptors.

  But statmount() and listmount() are inherently about the mount
  namespace as the main container of the mount imho.

  Allowing an fd is probably be fine though but it dilutes the concept a
  bit which I'm not too fond of. Oh well.

The list continues...

So if you can make the fd-based statmount() work for your use-case then
this is probably the thing we should do.

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] fs/namespace: add umounted mounts to umount_mnt_ns
  2025-10-06 13:45       ` Christian Brauner
@ 2025-10-07 18:40         ` Aleksa Sarai
  0 siblings, 0 replies; 9+ messages in thread
From: Aleksa Sarai @ 2025-10-07 18:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Pavel Tikhomirov, Al Viro, Bhavik Sachdev, linux-fsdevel,
	linux-kernel, Jan Kara, John Garry, Arnaldo Carvalho de Melo,
	Darrick J . Wong, Namhyung Kim, Ingo Molnar, Andrei Vagin,
	Alexander Mikhalitsyn

[-- Attachment #1: Type: text/plain, Size: 1359 bytes --]

On 2025-10-06, Christian Brauner <brauner@kernel.org> wrote:
> On Fri, Oct 03, 2025 at 01:03:46PM +0800, Pavel Tikhomirov wrote:
> > 3) We had an alternative approach to make unmounted mounts mountinfo visible
> > for the user, by allowing fd-based statmount() https://github.com/bsach64/linux/commit/ac0c03d44fb1e6f0745aec81079fca075e75b354
> > 
> > But we also recognize a problem with it that it would require getting
> > mountinfo from fd which is not root dentry of the mount, but any dentry (as
> > we (CRIU) don't really have an option to choose which fd will be given to
> > us).
> 
> The part about this just using an fd is - supresses gag reflex - fine.
> We do that with the mount namespaces for listmount() already via
> mnt_req->spare.
> 
> The part that I dislike is exactly the one you pointed out: using an
> arbitrary fd to retrieve information about the mount but it's probably
> something we can live with since the alternative is complicating the
> lifetime rules of the mount and namespace interaction.

Well, to be fair this is basically the fstatfs(2) API so while it is
kind of funky it's not without precedence. open_by_handle_at(2) has
something similar if you don't mind incredibly funky API comparisons. ;)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-10-07 18:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-02 12:48 [PATCH 0/4] export mount info for "unmounted" mounts Bhavik Sachdev
2025-10-02 12:48 ` [PATCH 1/4] fs/namespace: add umount_mnt_ns mount namespace for unmounted mounts Bhavik Sachdev
2025-10-02 12:48 ` [PATCH 2/4] fs/namespace: add umounted mounts to umount_mnt_ns Bhavik Sachdev
2025-10-02 16:34   ` Al Viro
2025-10-03  5:03     ` Pavel Tikhomirov
2025-10-06 13:45       ` Christian Brauner
2025-10-07 18:40         ` Aleksa Sarai
2025-10-02 12:48 ` [PATCH 3/4] statmount: allow for "unmounted" mounts Bhavik Sachdev
2025-10-02 12:48 ` [PATCH 4/4] fs/stat: export mnt_ns_id through statx Bhavik Sachdev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).