[PATCH v3 00/10] fs: lockless mntns lookup

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/10] fs: lockless mntns lookup
@ 2024-12-12 23:03 Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 01/10] mount: remove inlude/nospec.h include Christian Brauner
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

Hey,

This now also includes selftests for iterating mount namespaces both
backwards and forwards.

Currently we take the read lock when looking for a mount namespace to
list mounts in. We can make this lockless. The simple search case can
just use a sequence counter to detect concurrent changes to the rbtree.

For walking the list of mount namespaces sequentially via nsfs we keep a
separate rcu list as rb_prev() and rb_next() aren't usable safely with
rcu.

Since creating mount namespaces is a relatively rare event compared with
querying mounts in a foreign mount namespace this is worth it. Once
libmount and systemd pick up this mechanism to list mounts in foreign
mount namespaces this will be used very frequently.

Thanks!
Christian

---
Changes in v3:
- Add selftests.
- Put list_head into a union with the wait_queue_head_t for poll instead
  of the mnt_ns_tree_node which would've risked breaking rbtree
  traversal.
- Handle insertion into the mount namespace list correctly by making use
  of the rbtree position information after the mount namespace has been
  added to it.
- Improve the documentation for the new list_bidir_{del,prev}_rcu().
- Link to v2: https://lore.kernel.org/r/20241212-work-mount-rbtree-lockless-v2-0-4fe6cef02534@kernel.org

Changes in v2:
- Remove mnt_ns_find_it_at() by switching to rb_find_rcu().
- Add separate list to lookup sequential mount namespaces.
- Link to v1: https://lore.kernel.org/r/20241210-work-mount-rbtree-lockless-v1-0-338366b9bbe4@kernel.org

---
Christian Brauner (10):
      mount: remove inlude/nospec.h include
      fs: add mount namespace to rbtree late
      fs: lockless mntns rbtree lookup
      rculist: add list_bidir_{del,prev}_rcu()
      fs: lockless mntns lookup for nsfs
      fs: simplify rwlock to spinlock
      seltests: move nsfs into filesystems subfolder
      selftests: add tests for mntns iteration
      selftests: remove unneeded include
      samples: add test-list-all-mounts

 fs/mount.h                                         |  18 +-
 fs/namespace.c                                     | 163 ++++++++------
 fs/nsfs.c                                          |   5 +-
 include/linux/rculist.h                            |  47 +++++
 samples/vfs/.gitignore                             |   1 +
 samples/vfs/Makefile                               |   2 +-
 samples/vfs/test-list-all-mounts.c                 | 235 +++++++++++++++++++++
 .../selftests/{ => filesystems}/nsfs/.gitignore    |   1 +
 .../selftests/{ => filesystems}/nsfs/Makefile      |   4 +-
 .../selftests/{ => filesystems}/nsfs/config        |   0
 .../selftests/filesystems/nsfs/iterate_mntns.c     | 149 +++++++++++++
 .../selftests/{ => filesystems}/nsfs/owner.c       |   0
 .../selftests/{ => filesystems}/nsfs/pidns.c       |   0
 tools/testing/selftests/pidfd/pidfd.h              |   1 -
 14 files changed, 546 insertions(+), 80 deletions(-)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241207-work-mount-rbtree-lockless-7d4071b74f18


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 01/10] mount: remove inlude/nospec.h include
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 02/10] fs: add mount namespace to rbtree late Christian Brauner
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

It's not needed, so remove it.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 23e81c2a1e3fee7d97df2a84a69438a677933654..c3dbe6a7ab6b1c77c2693cc75941da89fa921048 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -32,7 +32,6 @@
 #include <linux/fs_context.h>
 #include <linux/shmem_fs.h>
 #include <linux/mnt_idmapping.h>
-#include <linux/nospec.h>
 
 #include "pnode.h"
 #include "internal.h"

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 02/10] fs: add mount namespace to rbtree late
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 01/10] mount: remove inlude/nospec.h include Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 03/10] fs: lockless mntns rbtree lookup Christian Brauner
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

There's no point doing that under the namespace semaphore it just gives
the false impression that it protects the mount namespace rbtree and it
simply doesn't.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c3dbe6a7ab6b1c77c2693cc75941da89fa921048..10fa18dd66018fadfdc9d18c59a851eed7bd55ad 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3983,7 +3983,6 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
 		while (p->mnt.mnt_root != q->mnt.mnt_root)
 			p = next_mnt(skip_mnt_tree(p), old);
 	}
-	mnt_ns_tree_add(new_ns);
 	namespace_unlock();
 
 	if (rootmnt)
@@ -3991,6 +3990,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
 	if (pwdmnt)
 		mntput(pwdmnt);
 
+	mnt_ns_tree_add(new_ns);
 	return new_ns;
 }
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 03/10] fs: lockless mntns rbtree lookup
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 01/10] mount: remove inlude/nospec.h include Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 02/10] fs: add mount namespace to rbtree late Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-13  8:50   ` Peter Zijlstra
                     ` (2 more replies)
  2024-12-12 23:03 ` [PATCH v3 04/10] rculist: add list_bidir_{del,prev}_rcu() Christian Brauner
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

Currently we use a read-write lock but for the simple search case we can
make this lockless. Creating a new mount namespace is a rather rare
event compared with querying mounts in a foreign mount namespace. Once
this is picked up by e.g., systemd to list mounts in another mount in
it's isolated services or in containers this will be used a lot so this
seems worthwhile doing.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/mount.h     |   5 ++-
 fs/namespace.c | 119 +++++++++++++++++++++++++++++++++++----------------------
 2 files changed, 77 insertions(+), 47 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 185fc56afc13338f8185fe818051444d540cbd5b..36ead0e45e8aa7614c00001102563a711d9dae6e 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -12,7 +12,10 @@ struct mnt_namespace {
 	struct user_namespace	*user_ns;
 	struct ucounts		*ucounts;
 	u64			seq;	/* Sequence number to prevent loops */
-	wait_queue_head_t poll;
+	union {
+		wait_queue_head_t	poll;
+		struct rcu_head		mnt_ns_rcu;
+	};
 	u64 event;
 	unsigned int		nr_mounts; /* # of mounts in the namespace */
 	unsigned int		pending_mounts;
diff --git a/fs/namespace.c b/fs/namespace.c
index 10fa18dd66018fadfdc9d18c59a851eed7bd55ad..52adee787eb1b6ee8831705b2b121854c3370fb3 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -79,6 +79,8 @@ static DECLARE_RWSEM(namespace_sem);
 static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
 static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
 static DEFINE_RWLOCK(mnt_ns_tree_lock);
+static seqcount_rwlock_t mnt_ns_tree_seqcount = SEQCNT_RWLOCK_ZERO(mnt_ns_tree_seqcount, &mnt_ns_tree_lock);
+
 static struct rb_root mnt_ns_tree = RB_ROOT; /* protected by mnt_ns_tree_lock */
 
 struct mount_kattr {
@@ -105,17 +107,6 @@ EXPORT_SYMBOL_GPL(fs_kobj);
  */
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
 
-static int mnt_ns_cmp(u64 seq, const struct mnt_namespace *ns)
-{
-	u64 seq_b = ns->seq;
-
-	if (seq < seq_b)
-		return -1;
-	if (seq > seq_b)
-		return 1;
-	return 0;
-}
-
 static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
 {
 	if (!node)
@@ -123,19 +114,41 @@ static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
 	return rb_entry(node, struct mnt_namespace, mnt_ns_tree_node);
 }
 
-static bool mnt_ns_less(struct rb_node *a, const struct rb_node *b)
+static int mnt_ns_cmp(struct rb_node *a, const struct rb_node *b)
 {
 	struct mnt_namespace *ns_a = node_to_mnt_ns(a);
 	struct mnt_namespace *ns_b = node_to_mnt_ns(b);
 	u64 seq_a = ns_a->seq;
+	u64 seq_b = ns_b->seq;
+
+	if (seq_a < seq_b)
+		return -1;
+	if (seq_a > seq_b)
+		return 1;
+	return 0;
+}
 
-	return mnt_ns_cmp(seq_a, ns_b) < 0;
+static inline void mnt_ns_tree_write_lock(void)
+{
+	write_lock(&mnt_ns_tree_lock);
+	write_seqcount_begin(&mnt_ns_tree_seqcount);
+}
+
+static inline void mnt_ns_tree_write_unlock(void)
+{
+	write_seqcount_end(&mnt_ns_tree_seqcount);
+	write_unlock(&mnt_ns_tree_lock);
 }
 
 static void mnt_ns_tree_add(struct mnt_namespace *ns)
 {
-	guard(write_lock)(&mnt_ns_tree_lock);
-	rb_add(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_less);
+	struct rb_node *node;
+
+	mnt_ns_tree_write_lock();
+	node = rb_find_add_rcu(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_cmp);
+	mnt_ns_tree_write_unlock();
+
+	WARN_ON_ONCE(node);
 }
 
 static void mnt_ns_release(struct mnt_namespace *ns)
@@ -150,41 +163,36 @@ static void mnt_ns_release(struct mnt_namespace *ns)
 }
 DEFINE_FREE(mnt_ns_release, struct mnt_namespace *, if (_T) mnt_ns_release(_T))
 
+static void mnt_ns_release_rcu(struct rcu_head *rcu)
+{
+	struct mnt_namespace *mnt_ns;
+
+	mnt_ns = container_of(rcu, struct mnt_namespace, mnt_ns_rcu);
+	mnt_ns_release(mnt_ns);
+}
+
 static void mnt_ns_tree_remove(struct mnt_namespace *ns)
 {
 	/* remove from global mount namespace list */
 	if (!is_anon_ns(ns)) {
-		guard(write_lock)(&mnt_ns_tree_lock);
+		mnt_ns_tree_write_lock();
 		rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree);
+		mnt_ns_tree_write_unlock();
 	}
 
-	mnt_ns_release(ns);
+	call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
 }
 
-/*
- * Returns the mount namespace which either has the specified id, or has the
- * next smallest id afer the specified one.
- */
-static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
+static int mnt_ns_find(const void *key, const struct rb_node *node)
 {
-	struct rb_node *node = mnt_ns_tree.rb_node;
-	struct mnt_namespace *ret = NULL;
-
-	lockdep_assert_held(&mnt_ns_tree_lock);
-
-	while (node) {
-		struct mnt_namespace *n = node_to_mnt_ns(node);
+	const u64 mnt_ns_id = *(u64 *)key;
+	const struct mnt_namespace *ns = node_to_mnt_ns(node);
 
-		if (mnt_ns_id <= n->seq) {
-			ret = node_to_mnt_ns(node);
-			if (mnt_ns_id == n->seq)
-				break;
-			node = node->rb_left;
-		} else {
-			node = node->rb_right;
-		}
-	}
-	return ret;
+	if (mnt_ns_id < ns->seq)
+		return -1;
+	if (mnt_ns_id > ns->seq)
+		return 1;
+	return 0;
 }
 
 /*
@@ -194,18 +202,37 @@ static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
  * namespace the @namespace_sem must first be acquired. If the namespace has
  * already shut down before acquiring @namespace_sem, {list,stat}mount() will
  * see that the mount rbtree of the namespace is empty.
+ *
+ * Note the lookup is lockless protected by a sequence counter. We only
+ * need to guard against false negatives as false positives aren't
+ * possible. So if we didn't find a mount namespace and the sequence
+ * counter has changed we need to retry. If the sequence counter is
+ * still the same we know the search actually failed.
  */
 static struct mnt_namespace *lookup_mnt_ns(u64 mnt_ns_id)
 {
-       struct mnt_namespace *ns;
+	struct mnt_namespace *ns;
+	struct rb_node *node;
+	unsigned int seq;
+
+	guard(rcu)();
+	do {
+		seq = read_seqcount_begin(&mnt_ns_tree_seqcount);
+		node = rb_find_rcu(&mnt_ns_id, &mnt_ns_tree, mnt_ns_find);
+		if (node)
+			break;
+	} while (read_seqcount_retry(&mnt_ns_tree_seqcount, seq));
 
-       guard(read_lock)(&mnt_ns_tree_lock);
-       ns = mnt_ns_find_id_at(mnt_ns_id);
-       if (!ns || ns->seq != mnt_ns_id)
-               return NULL;
+	if (!node)
+		return NULL;
 
-       refcount_inc(&ns->passive);
-       return ns;
+	/*
+	 * The last reference count is put with RCU delay so we can
+	 * unconditonally acquire a reference here.
+	 */
+	ns = node_to_mnt_ns(node);
+	refcount_inc(&ns->passive);
+	return ns;
 }
 
 static inline void lock_mount_hash(void)

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 03/10] fs: lockless mntns rbtree lookup
  2024-12-12 23:03 ` [PATCH v3 03/10] fs: lockless mntns rbtree lookup Christian Brauner
@ 2024-12-13  8:50   ` Peter Zijlstra
  2024-12-13 14:11   ` Jeff Layton
  2024-12-19  9:20   ` Lai, Yi
  2 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2024-12-13  8:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Josef Bacik, Jeff Layton, Paul E. McKenney, linux-fsdevel

On Fri, Dec 13, 2024 at 12:03:42AM +0100, Christian Brauner wrote:
> +static inline void mnt_ns_tree_write_lock(void)
> +{
> +	write_lock(&mnt_ns_tree_lock);
> +	write_seqcount_begin(&mnt_ns_tree_seqcount);
> +}
> +
> +static inline void mnt_ns_tree_write_unlock(void)
> +{
> +	write_seqcount_end(&mnt_ns_tree_seqcount);
> +	write_unlock(&mnt_ns_tree_lock);
>  }

>  static void mnt_ns_tree_add(struct mnt_namespace *ns)
>  {
> -	guard(write_lock)(&mnt_ns_tree_lock);
> -	rb_add(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_less);
> +	struct rb_node *node;
> +
> +	mnt_ns_tree_write_lock();
> +	node = rb_find_add_rcu(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_cmp);
> +	mnt_ns_tree_write_unlock();
> +
> +	WARN_ON_ONCE(node);
>  }

>  static void mnt_ns_tree_remove(struct mnt_namespace *ns)
>  {
>  	/* remove from global mount namespace list */
>  	if (!is_anon_ns(ns)) {
> -		guard(write_lock)(&mnt_ns_tree_lock);
> +		mnt_ns_tree_write_lock();
>  		rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree);
> +		mnt_ns_tree_write_unlock();
>  	}
>  
> -	mnt_ns_release(ns);
> +	call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
>  }

Its probably not worth the effort, but I figured I'd mention it anyway,
if you do:

DEFINE_LOCK_GUARD_0(mnt_ns_tree_lock, mnt_ns_tree_lock(), mnt_ns_tree_unlock())

You can use: guard(mnt_ns_tree_lock)();

But like said, probably not worth it, given the above are the only two
sites and they're utterly trivial.


Anyway, rest of the patches look good now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 03/10] fs: lockless mntns rbtree lookup
  2024-12-12 23:03 ` [PATCH v3 03/10] fs: lockless mntns rbtree lookup Christian Brauner
  2024-12-13  8:50   ` Peter Zijlstra
@ 2024-12-13 14:11   ` Jeff Layton
  2024-12-13 18:44     ` Christian Brauner
  2024-12-19  9:20   ` Lai, Yi
  2 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2024-12-13 14:11 UTC (permalink / raw)
  To: Christian Brauner, Josef Bacik
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel

On Fri, 2024-12-13 at 00:03 +0100, Christian Brauner wrote:
> Currently we use a read-write lock but for the simple search case we can
> make this lockless. Creating a new mount namespace is a rather rare
> event compared with querying mounts in a foreign mount namespace. Once
> this is picked up by e.g., systemd to list mounts in another mount in
> it's isolated services or in containers this will be used a lot so this
> seems worthwhile doing.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/mount.h     |   5 ++-
>  fs/namespace.c | 119 +++++++++++++++++++++++++++++++++++----------------------
>  2 files changed, 77 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/mount.h b/fs/mount.h
> index 185fc56afc13338f8185fe818051444d540cbd5b..36ead0e45e8aa7614c00001102563a711d9dae6e 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -12,7 +12,10 @@ struct mnt_namespace {
>  	struct user_namespace	*user_ns;
>  	struct ucounts		*ucounts;
>  	u64			seq;	/* Sequence number to prevent loops */
> -	wait_queue_head_t poll;
> +	union {
> +		wait_queue_head_t	poll;
> +		struct rcu_head		mnt_ns_rcu;
> +	};
>  	u64 event;
>  	unsigned int		nr_mounts; /* # of mounts in the namespace */
>  	unsigned int		pending_mounts;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 10fa18dd66018fadfdc9d18c59a851eed7bd55ad..52adee787eb1b6ee8831705b2b121854c3370fb3 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -79,6 +79,8 @@ static DECLARE_RWSEM(namespace_sem);
>  static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
>  static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
>  static DEFINE_RWLOCK(mnt_ns_tree_lock);
> +static seqcount_rwlock_t mnt_ns_tree_seqcount = SEQCNT_RWLOCK_ZERO(mnt_ns_tree_seqcount, &mnt_ns_tree_lock);
> +
>  static struct rb_root mnt_ns_tree = RB_ROOT; /* protected by mnt_ns_tree_lock */
>  
>  struct mount_kattr {
> @@ -105,17 +107,6 @@ EXPORT_SYMBOL_GPL(fs_kobj);
>   */
>  __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
>  
> -static int mnt_ns_cmp(u64 seq, const struct mnt_namespace *ns)
> -{
> -	u64 seq_b = ns->seq;
> -
> -	if (seq < seq_b)
> -		return -1;
> -	if (seq > seq_b)
> -		return 1;
> -	return 0;
> -}
> -
>  static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
>  {
>  	if (!node)
> @@ -123,19 +114,41 @@ static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
>  	return rb_entry(node, struct mnt_namespace, mnt_ns_tree_node);
>  }
>  
> -static bool mnt_ns_less(struct rb_node *a, const struct rb_node *b)
> +static int mnt_ns_cmp(struct rb_node *a, const struct rb_node *b)
>  {
>  	struct mnt_namespace *ns_a = node_to_mnt_ns(a);
>  	struct mnt_namespace *ns_b = node_to_mnt_ns(b);
>  	u64 seq_a = ns_a->seq;
> +	u64 seq_b = ns_b->seq;
> +
> +	if (seq_a < seq_b)
> +		return -1;
> +	if (seq_a > seq_b)
> +		return 1;
> +	return 0;
> +}
>  
> -	return mnt_ns_cmp(seq_a, ns_b) < 0;
> +static inline void mnt_ns_tree_write_lock(void)
> +{
> +	write_lock(&mnt_ns_tree_lock);
> +	write_seqcount_begin(&mnt_ns_tree_seqcount);
> +}
> +
> +static inline void mnt_ns_tree_write_unlock(void)
> +{
> +	write_seqcount_end(&mnt_ns_tree_seqcount);
> +	write_unlock(&mnt_ns_tree_lock);
>  }
>  
>  static void mnt_ns_tree_add(struct mnt_namespace *ns)
>  {
> -	guard(write_lock)(&mnt_ns_tree_lock);
> -	rb_add(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_less);
> +	struct rb_node *node;
> +
> +	mnt_ns_tree_write_lock();
> +	node = rb_find_add_rcu(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_cmp);
> +	mnt_ns_tree_write_unlock();
> +
> +	WARN_ON_ONCE(node);
>  }
>  
>  static void mnt_ns_release(struct mnt_namespace *ns)
> @@ -150,41 +163,36 @@ static void mnt_ns_release(struct mnt_namespace *ns)
>  }
>  DEFINE_FREE(mnt_ns_release, struct mnt_namespace *, if (_T) mnt_ns_release(_T))
>  
> +static void mnt_ns_release_rcu(struct rcu_head *rcu)
> +{
> +	struct mnt_namespace *mnt_ns;
> +
> +	mnt_ns = container_of(rcu, struct mnt_namespace, mnt_ns_rcu);
> +	mnt_ns_release(mnt_ns);
> +}
> +
>  static void mnt_ns_tree_remove(struct mnt_namespace *ns)
>  {
>  	/* remove from global mount namespace list */
>  	if (!is_anon_ns(ns)) {
> -		guard(write_lock)(&mnt_ns_tree_lock);
> +		mnt_ns_tree_write_lock();
>  		rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree);
> +		mnt_ns_tree_write_unlock();
>  	}
>  
> -	mnt_ns_release(ns);
> +	call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
>  }
>  
> -/*
> - * Returns the mount namespace which either has the specified id, or has the
> - * next smallest id afer the specified one.
> - */
> -static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
> +static int mnt_ns_find(const void *key, const struct rb_node *node)
>  {
> -	struct rb_node *node = mnt_ns_tree.rb_node;
> -	struct mnt_namespace *ret = NULL;
> -
> -	lockdep_assert_held(&mnt_ns_tree_lock);
> -
> -	while (node) {
> -		struct mnt_namespace *n = node_to_mnt_ns(node);
> +	const u64 mnt_ns_id = *(u64 *)key;
> +	const struct mnt_namespace *ns = node_to_mnt_ns(node);
>  
> -		if (mnt_ns_id <= n->seq) {
> -			ret = node_to_mnt_ns(node);
> -			if (mnt_ns_id == n->seq)
> -				break;
> -			node = node->rb_left;
> -		} else {
> -			node = node->rb_right;
> -		}
> -	}
> -	return ret;
> +	if (mnt_ns_id < ns->seq)
> +		return -1;
> +	if (mnt_ns_id > ns->seq)
> +		return 1;
> +	return 0;
>  }
>  
>  /*
> @@ -194,18 +202,37 @@ static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
>   * namespace the @namespace_sem must first be acquired. If the namespace has
>   * already shut down before acquiring @namespace_sem, {list,stat}mount() will
>   * see that the mount rbtree of the namespace is empty.
> + *
> + * Note the lookup is lockless protected by a sequence counter. We only
> + * need to guard against false negatives as false positives aren't
> + * possible. So if we didn't find a mount namespace and the sequence
> + * counter has changed we need to retry. If the sequence counter is
> + * still the same we know the search actually failed.
>   */
>  static struct mnt_namespace *lookup_mnt_ns(u64 mnt_ns_id)
>  {
> -       struct mnt_namespace *ns;
> +	struct mnt_namespace *ns;
> +	struct rb_node *node;
> +	unsigned int seq;
> +
> +	guard(rcu)();
> +	do {
> +		seq = read_seqcount_begin(&mnt_ns_tree_seqcount);
> +		node = rb_find_rcu(&mnt_ns_id, &mnt_ns_tree, mnt_ns_find);
> +		if (node)
> +			break;
> +	} while (read_seqcount_retry(&mnt_ns_tree_seqcount, seq));
>  
> -       guard(read_lock)(&mnt_ns_tree_lock);
> -       ns = mnt_ns_find_id_at(mnt_ns_id);
> -       if (!ns || ns->seq != mnt_ns_id)
> -               return NULL;
> +	if (!node)
> +		return NULL;
>  
> -       refcount_inc(&ns->passive);
> -       return ns;
> +	/*
> +	 * The last reference count is put with RCU delay so we can
> +	 * unconditonally acquire a reference here.
> +	 */
> +	ns = node_to_mnt_ns(node);
> +	refcount_inc(&ns->passive);

I'm a little uneasy with the unconditional refcount_inc() here. It
seems quite possible that this could to a 0->1 transition here. You may
be right that that technically won't cause a problem with the rcu lock
held, but at the very least, that will cause a refcount warning to pop.

Maybe this should be a refcount_inc_not_zero() and then you just return
NULL if the increment doesn't occur?

> +	return ns;
>  }
>  
>  static inline void lock_mount_hash(void)
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 03/10] fs: lockless mntns rbtree lookup
  2024-12-13 14:11   ` Jeff Layton
@ 2024-12-13 18:44     ` Christian Brauner
  2024-12-13 19:02       ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2024-12-13 18:44 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Josef Bacik, Paul E. McKenney, Peter Ziljstra, linux-fsdevel

On Fri, Dec 13, 2024 at 09:11:41AM -0500, Jeff Layton wrote:
> On Fri, 2024-12-13 at 00:03 +0100, Christian Brauner wrote:
> > Currently we use a read-write lock but for the simple search case we can
> > make this lockless. Creating a new mount namespace is a rather rare
> > event compared with querying mounts in a foreign mount namespace. Once
> > this is picked up by e.g., systemd to list mounts in another mount in
> > it's isolated services or in containers this will be used a lot so this
> > seems worthwhile doing.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/mount.h     |   5 ++-
> >  fs/namespace.c | 119 +++++++++++++++++++++++++++++++++++----------------------
> >  2 files changed, 77 insertions(+), 47 deletions(-)
> > 
> > diff --git a/fs/mount.h b/fs/mount.h
> > index 185fc56afc13338f8185fe818051444d540cbd5b..36ead0e45e8aa7614c00001102563a711d9dae6e 100644
> > --- a/fs/mount.h
> > +++ b/fs/mount.h
> > @@ -12,7 +12,10 @@ struct mnt_namespace {
> >  	struct user_namespace	*user_ns;
> >  	struct ucounts		*ucounts;
> >  	u64			seq;	/* Sequence number to prevent loops */
> > -	wait_queue_head_t poll;
> > +	union {
> > +		wait_queue_head_t	poll;
> > +		struct rcu_head		mnt_ns_rcu;
> > +	};
> >  	u64 event;
> >  	unsigned int		nr_mounts; /* # of mounts in the namespace */
> >  	unsigned int		pending_mounts;
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 10fa18dd66018fadfdc9d18c59a851eed7bd55ad..52adee787eb1b6ee8831705b2b121854c3370fb3 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -79,6 +79,8 @@ static DECLARE_RWSEM(namespace_sem);
> >  static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
> >  static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
> >  static DEFINE_RWLOCK(mnt_ns_tree_lock);
> > +static seqcount_rwlock_t mnt_ns_tree_seqcount = SEQCNT_RWLOCK_ZERO(mnt_ns_tree_seqcount, &mnt_ns_tree_lock);
> > +
> >  static struct rb_root mnt_ns_tree = RB_ROOT; /* protected by mnt_ns_tree_lock */
> >  
> >  struct mount_kattr {
> > @@ -105,17 +107,6 @@ EXPORT_SYMBOL_GPL(fs_kobj);
> >   */
> >  __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
> >  
> > -static int mnt_ns_cmp(u64 seq, const struct mnt_namespace *ns)
> > -{
> > -	u64 seq_b = ns->seq;
> > -
> > -	if (seq < seq_b)
> > -		return -1;
> > -	if (seq > seq_b)
> > -		return 1;
> > -	return 0;
> > -}
> > -
> >  static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
> >  {
> >  	if (!node)
> > @@ -123,19 +114,41 @@ static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
> >  	return rb_entry(node, struct mnt_namespace, mnt_ns_tree_node);
> >  }
> >  
> > -static bool mnt_ns_less(struct rb_node *a, const struct rb_node *b)
> > +static int mnt_ns_cmp(struct rb_node *a, const struct rb_node *b)
> >  {
> >  	struct mnt_namespace *ns_a = node_to_mnt_ns(a);
> >  	struct mnt_namespace *ns_b = node_to_mnt_ns(b);
> >  	u64 seq_a = ns_a->seq;
> > +	u64 seq_b = ns_b->seq;
> > +
> > +	if (seq_a < seq_b)
> > +		return -1;
> > +	if (seq_a > seq_b)
> > +		return 1;
> > +	return 0;
> > +}
> >  
> > -	return mnt_ns_cmp(seq_a, ns_b) < 0;
> > +static inline void mnt_ns_tree_write_lock(void)
> > +{
> > +	write_lock(&mnt_ns_tree_lock);
> > +	write_seqcount_begin(&mnt_ns_tree_seqcount);
> > +}
> > +
> > +static inline void mnt_ns_tree_write_unlock(void)
> > +{
> > +	write_seqcount_end(&mnt_ns_tree_seqcount);
> > +	write_unlock(&mnt_ns_tree_lock);
> >  }
> >  
> >  static void mnt_ns_tree_add(struct mnt_namespace *ns)
> >  {
> > -	guard(write_lock)(&mnt_ns_tree_lock);
> > -	rb_add(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_less);
> > +	struct rb_node *node;
> > +
> > +	mnt_ns_tree_write_lock();
> > +	node = rb_find_add_rcu(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_cmp);
> > +	mnt_ns_tree_write_unlock();
> > +
> > +	WARN_ON_ONCE(node);
> >  }
> >  
> >  static void mnt_ns_release(struct mnt_namespace *ns)
> > @@ -150,41 +163,36 @@ static void mnt_ns_release(struct mnt_namespace *ns)
> >  }
> >  DEFINE_FREE(mnt_ns_release, struct mnt_namespace *, if (_T) mnt_ns_release(_T))
> >  
> > +static void mnt_ns_release_rcu(struct rcu_head *rcu)
> > +{
> > +	struct mnt_namespace *mnt_ns;
> > +
> > +	mnt_ns = container_of(rcu, struct mnt_namespace, mnt_ns_rcu);
> > +	mnt_ns_release(mnt_ns);
> > +}
> > +
> >  static void mnt_ns_tree_remove(struct mnt_namespace *ns)
> >  {
> >  	/* remove from global mount namespace list */
> >  	if (!is_anon_ns(ns)) {
> > -		guard(write_lock)(&mnt_ns_tree_lock);
> > +		mnt_ns_tree_write_lock();
> >  		rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree);
> > +		mnt_ns_tree_write_unlock();
> >  	}
> >  
> > -	mnt_ns_release(ns);
> > +	call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
> >  }
> >  
> > -/*
> > - * Returns the mount namespace which either has the specified id, or has the
> > - * next smallest id afer the specified one.
> > - */
> > -static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
> > +static int mnt_ns_find(const void *key, const struct rb_node *node)
> >  {
> > -	struct rb_node *node = mnt_ns_tree.rb_node;
> > -	struct mnt_namespace *ret = NULL;
> > -
> > -	lockdep_assert_held(&mnt_ns_tree_lock);
> > -
> > -	while (node) {
> > -		struct mnt_namespace *n = node_to_mnt_ns(node);
> > +	const u64 mnt_ns_id = *(u64 *)key;
> > +	const struct mnt_namespace *ns = node_to_mnt_ns(node);
> >  
> > -		if (mnt_ns_id <= n->seq) {
> > -			ret = node_to_mnt_ns(node);
> > -			if (mnt_ns_id == n->seq)
> > -				break;
> > -			node = node->rb_left;
> > -		} else {
> > -			node = node->rb_right;
> > -		}
> > -	}
> > -	return ret;
> > +	if (mnt_ns_id < ns->seq)
> > +		return -1;
> > +	if (mnt_ns_id > ns->seq)
> > +		return 1;
> > +	return 0;
> >  }
> >  
> >  /*
> > @@ -194,18 +202,37 @@ static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
> >   * namespace the @namespace_sem must first be acquired. If the namespace has
> >   * already shut down before acquiring @namespace_sem, {list,stat}mount() will
> >   * see that the mount rbtree of the namespace is empty.
> > + *
> > + * Note the lookup is lockless protected by a sequence counter. We only
> > + * need to guard against false negatives as false positives aren't
> > + * possible. So if we didn't find a mount namespace and the sequence
> > + * counter has changed we need to retry. If the sequence counter is
> > + * still the same we know the search actually failed.
> >   */
> >  static struct mnt_namespace *lookup_mnt_ns(u64 mnt_ns_id)
> >  {
> > -       struct mnt_namespace *ns;
> > +	struct mnt_namespace *ns;
> > +	struct rb_node *node;
> > +	unsigned int seq;
> > +
> > +	guard(rcu)();
> > +	do {
> > +		seq = read_seqcount_begin(&mnt_ns_tree_seqcount);
> > +		node = rb_find_rcu(&mnt_ns_id, &mnt_ns_tree, mnt_ns_find);
> > +		if (node)
> > +			break;
> > +	} while (read_seqcount_retry(&mnt_ns_tree_seqcount, seq));
> >  
> > -       guard(read_lock)(&mnt_ns_tree_lock);
> > -       ns = mnt_ns_find_id_at(mnt_ns_id);
> > -       if (!ns || ns->seq != mnt_ns_id)
> > -               return NULL;
> > +	if (!node)
> > +		return NULL;
> >  
> > -       refcount_inc(&ns->passive);
> > -       return ns;
> > +	/*
> > +	 * The last reference count is put with RCU delay so we can
> > +	 * unconditonally acquire a reference here.
> > +	 */
> > +	ns = node_to_mnt_ns(node);
> > +	refcount_inc(&ns->passive);
> 
> I'm a little uneasy with the unconditional refcount_inc() here. It
> seems quite possible that this could to a 0->1 transition here. You may
> be right that that technically won't cause a problem with the rcu lock
> held, but at the very least, that will cause a refcount warning to pop.
> 
> Maybe this should be a refcount_inc_not_zero() and then you just return
> NULL if the increment doesn't occur?

So this shouldn't be possible (and Paul is on the thread and can tell me
if I'm wrong) because:

call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
-> mnt_ns_release_rcu()
   -> mnt_ns_release()
      -> refcount_dec_and_test()

Which means that decrements are RCU delayed. Any reader must walk the
list holding the RCU lock. If they find the mount namespace still on the
list then mnt_ns_release() will be deferred until they are done.

In order for what you describe to happen a reader must find that mount
namespace still in the list or rbtree and mnt_ns_release() is called
directly.

But afaict that doesn't happen. mnt_ns_release() is only called directly
when the mount namespace has never been on any of the lists.

refcount_inc() will already WARN() if the previous value was 0 and
splat. If we use refcount_inc_not_zero() we're adding additional memory
barriers when we simply don't need them.

But if that's important to you though than I'd rather switch the passive
count to atomic_t and use atomic_dec_and_test() in mnt_ns_release() and
then use similar logic as I used for file_ref_inc(): 

unsigned long prior = atomic_fetch_inc_relaxed(&ns->passive);
WARN_ONCE(!prior, "increment from zero UAF condition present");

This would give us the opportunity to catch a "shouldn't happen
condition". But using refcount_inc_not_zero() would e.g., confuse me a
few months down the line (Yes, we could probably add a comment.).

> 
> > +	return ns;
> >  }
> >  
> >  static inline void lock_mount_hash(void)
> > 
> 
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 03/10] fs: lockless mntns rbtree lookup
  2024-12-13 18:44     ` Christian Brauner
@ 2024-12-13 19:02       ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2024-12-13 19:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Josef Bacik, Paul E. McKenney, Peter Ziljstra, linux-fsdevel

On Fri, 2024-12-13 at 19:44 +0100, Christian Brauner wrote:
> On Fri, Dec 13, 2024 at 09:11:41AM -0500, Jeff Layton wrote:
> > On Fri, 2024-12-13 at 00:03 +0100, Christian Brauner wrote:
> > > Currently we use a read-write lock but for the simple search case we can
> > > make this lockless. Creating a new mount namespace is a rather rare
> > > event compared with querying mounts in a foreign mount namespace. Once
> > > this is picked up by e.g., systemd to list mounts in another mount in
> > > it's isolated services or in containers this will be used a lot so this
> > > seems worthwhile doing.
> > > 
> > > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > > ---
> > >  fs/mount.h     |   5 ++-
> > >  fs/namespace.c | 119 +++++++++++++++++++++++++++++++++++----------------------
> > >  2 files changed, 77 insertions(+), 47 deletions(-)
> > > 
> > > diff --git a/fs/mount.h b/fs/mount.h
> > > index 185fc56afc13338f8185fe818051444d540cbd5b..36ead0e45e8aa7614c00001102563a711d9dae6e 100644
> > > --- a/fs/mount.h
> > > +++ b/fs/mount.h
> > > @@ -12,7 +12,10 @@ struct mnt_namespace {
> > >  	struct user_namespace	*user_ns;
> > >  	struct ucounts		*ucounts;
> > >  	u64			seq;	/* Sequence number to prevent loops */
> > > -	wait_queue_head_t poll;
> > > +	union {
> > > +		wait_queue_head_t	poll;
> > > +		struct rcu_head		mnt_ns_rcu;
> > > +	};
> > >  	u64 event;
> > >  	unsigned int		nr_mounts; /* # of mounts in the namespace */
> > >  	unsigned int		pending_mounts;
> > > diff --git a/fs/namespace.c b/fs/namespace.c
> > > index 10fa18dd66018fadfdc9d18c59a851eed7bd55ad..52adee787eb1b6ee8831705b2b121854c3370fb3 100644
> > > --- a/fs/namespace.c
> > > +++ b/fs/namespace.c
> > > @@ -79,6 +79,8 @@ static DECLARE_RWSEM(namespace_sem);
> > >  static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
> > >  static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
> > >  static DEFINE_RWLOCK(mnt_ns_tree_lock);
> > > +static seqcount_rwlock_t mnt_ns_tree_seqcount = SEQCNT_RWLOCK_ZERO(mnt_ns_tree_seqcount, &mnt_ns_tree_lock);
> > > +
> > >  static struct rb_root mnt_ns_tree = RB_ROOT; /* protected by mnt_ns_tree_lock */
> > >  
> > >  struct mount_kattr {
> > > @@ -105,17 +107,6 @@ EXPORT_SYMBOL_GPL(fs_kobj);
> > >   */
> > >  __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
> > >  
> > > -static int mnt_ns_cmp(u64 seq, const struct mnt_namespace *ns)
> > > -{
> > > -	u64 seq_b = ns->seq;
> > > -
> > > -	if (seq < seq_b)
> > > -		return -1;
> > > -	if (seq > seq_b)
> > > -		return 1;
> > > -	return 0;
> > > -}
> > > -
> > >  static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
> > >  {
> > >  	if (!node)
> > > @@ -123,19 +114,41 @@ static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
> > >  	return rb_entry(node, struct mnt_namespace, mnt_ns_tree_node);
> > >  }
> > >  
> > > -static bool mnt_ns_less(struct rb_node *a, const struct rb_node *b)
> > > +static int mnt_ns_cmp(struct rb_node *a, const struct rb_node *b)
> > >  {
> > >  	struct mnt_namespace *ns_a = node_to_mnt_ns(a);
> > >  	struct mnt_namespace *ns_b = node_to_mnt_ns(b);
> > >  	u64 seq_a = ns_a->seq;
> > > +	u64 seq_b = ns_b->seq;
> > > +
> > > +	if (seq_a < seq_b)
> > > +		return -1;
> > > +	if (seq_a > seq_b)
> > > +		return 1;
> > > +	return 0;
> > > +}
> > >  
> > > -	return mnt_ns_cmp(seq_a, ns_b) < 0;
> > > +static inline void mnt_ns_tree_write_lock(void)
> > > +{
> > > +	write_lock(&mnt_ns_tree_lock);
> > > +	write_seqcount_begin(&mnt_ns_tree_seqcount);
> > > +}
> > > +
> > > +static inline void mnt_ns_tree_write_unlock(void)
> > > +{
> > > +	write_seqcount_end(&mnt_ns_tree_seqcount);
> > > +	write_unlock(&mnt_ns_tree_lock);
> > >  }
> > >  
> > >  static void mnt_ns_tree_add(struct mnt_namespace *ns)
> > >  {
> > > -	guard(write_lock)(&mnt_ns_tree_lock);
> > > -	rb_add(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_less);
> > > +	struct rb_node *node;
> > > +
> > > +	mnt_ns_tree_write_lock();
> > > +	node = rb_find_add_rcu(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_cmp);
> > > +	mnt_ns_tree_write_unlock();
> > > +
> > > +	WARN_ON_ONCE(node);
> > >  }
> > >  
> > >  static void mnt_ns_release(struct mnt_namespace *ns)
> > > @@ -150,41 +163,36 @@ static void mnt_ns_release(struct mnt_namespace *ns)
> > >  }
> > >  DEFINE_FREE(mnt_ns_release, struct mnt_namespace *, if (_T) mnt_ns_release(_T))
> > >  
> > > +static void mnt_ns_release_rcu(struct rcu_head *rcu)
> > > +{
> > > +	struct mnt_namespace *mnt_ns;
> > > +
> > > +	mnt_ns = container_of(rcu, struct mnt_namespace, mnt_ns_rcu);
> > > +	mnt_ns_release(mnt_ns);
> > > +}
> > > +
> > >  static void mnt_ns_tree_remove(struct mnt_namespace *ns)
> > >  {
> > >  	/* remove from global mount namespace list */
> > >  	if (!is_anon_ns(ns)) {
> > > -		guard(write_lock)(&mnt_ns_tree_lock);
> > > +		mnt_ns_tree_write_lock();
> > >  		rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree);
> > > +		mnt_ns_tree_write_unlock();
> > >  	}
> > >  
> > > -	mnt_ns_release(ns);
> > > +	call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
> > >  }
> > >  
> > > -/*
> > > - * Returns the mount namespace which either has the specified id, or has the
> > > - * next smallest id afer the specified one.
> > > - */
> > > -static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
> > > +static int mnt_ns_find(const void *key, const struct rb_node *node)
> > >  {
> > > -	struct rb_node *node = mnt_ns_tree.rb_node;
> > > -	struct mnt_namespace *ret = NULL;
> > > -
> > > -	lockdep_assert_held(&mnt_ns_tree_lock);
> > > -
> > > -	while (node) {
> > > -		struct mnt_namespace *n = node_to_mnt_ns(node);
> > > +	const u64 mnt_ns_id = *(u64 *)key;
> > > +	const struct mnt_namespace *ns = node_to_mnt_ns(node);
> > >  
> > > -		if (mnt_ns_id <= n->seq) {
> > > -			ret = node_to_mnt_ns(node);
> > > -			if (mnt_ns_id == n->seq)
> > > -				break;
> > > -			node = node->rb_left;
> > > -		} else {
> > > -			node = node->rb_right;
> > > -		}
> > > -	}
> > > -	return ret;
> > > +	if (mnt_ns_id < ns->seq)
> > > +		return -1;
> > > +	if (mnt_ns_id > ns->seq)
> > > +		return 1;
> > > +	return 0;
> > >  }
> > >  
> > >  /*
> > > @@ -194,18 +202,37 @@ static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
> > >   * namespace the @namespace_sem must first be acquired. If the namespace has
> > >   * already shut down before acquiring @namespace_sem, {list,stat}mount() will
> > >   * see that the mount rbtree of the namespace is empty.
> > > + *
> > > + * Note the lookup is lockless protected by a sequence counter. We only
> > > + * need to guard against false negatives as false positives aren't
> > > + * possible. So if we didn't find a mount namespace and the sequence
> > > + * counter has changed we need to retry. If the sequence counter is
> > > + * still the same we know the search actually failed.
> > >   */
> > >  static struct mnt_namespace *lookup_mnt_ns(u64 mnt_ns_id)
> > >  {
> > > -       struct mnt_namespace *ns;
> > > +	struct mnt_namespace *ns;
> > > +	struct rb_node *node;
> > > +	unsigned int seq;
> > > +
> > > +	guard(rcu)();
> > > +	do {
> > > +		seq = read_seqcount_begin(&mnt_ns_tree_seqcount);
> > > +		node = rb_find_rcu(&mnt_ns_id, &mnt_ns_tree, mnt_ns_find);
> > > +		if (node)
> > > +			break;
> > > +	} while (read_seqcount_retry(&mnt_ns_tree_seqcount, seq));
> > >  
> > > -       guard(read_lock)(&mnt_ns_tree_lock);
> > > -       ns = mnt_ns_find_id_at(mnt_ns_id);
> > > -       if (!ns || ns->seq != mnt_ns_id)
> > > -               return NULL;
> > > +	if (!node)
> > > +		return NULL;
> > >  
> > > -       refcount_inc(&ns->passive);
> > > -       return ns;
> > > +	/*
> > > +	 * The last reference count is put with RCU delay so we can
> > > +	 * unconditonally acquire a reference here.
> > > +	 */
> > > +	ns = node_to_mnt_ns(node);
> > > +	refcount_inc(&ns->passive);
> > 
> > I'm a little uneasy with the unconditional refcount_inc() here. It
> > seems quite possible that this could to a 0->1 transition here. You may
> > be right that that technically won't cause a problem with the rcu lock
> > held, but at the very least, that will cause a refcount warning to pop.
> > 
> > Maybe this should be a refcount_inc_not_zero() and then you just return
> > NULL if the increment doesn't occur?
> 
> So this shouldn't be possible (and Paul is on the thread and can tell me
> if I'm wrong) because:
> 
> call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
> -> mnt_ns_release_rcu()
>    -> mnt_ns_release()
>       -> refcount_dec_and_test()
> 
> Which means that decrements are RCU delayed. Any reader must walk the
> list holding the RCU lock. If they find the mount namespace still on the
> list then mnt_ns_release() will be deferred until they are done.
> 
>
> In order for what you describe to happen a reader must find that mount
> namespace still in the list or rbtree and mnt_ns_release() is called
> directly.
> 
> But afaict that doesn't happen. mnt_ns_release() is only called directly
> when the mount namespace has never been on any of the lists.
> 

> refcount_inc() will already WARN() if the previous value was 0 and
> splat. If we use refcount_inc_not_zero() we're adding additional memory
> barriers when we simply don't need them.

Ahh ok, I missed the detail that the refcount decrement was being done
in the RCU callback. With that, I guess that can't ever happen. Sorry
for the noise!

> But if that's important to you though than I'd rather switch the passive
> count to atomic_t and use atomic_dec_and_test() in mnt_ns_release() and
> then use similar logic as I used for file_ref_inc(): 
> 
> unsigned long prior = atomic_fetch_inc_relaxed(&ns->passive);
> WARN_ONCE(!prior, "increment from zero UAF condition present");
> 
> This would give us the opportunity to catch a "shouldn't happen
> condition". But using refcount_inc_not_zero() would e.g., confuse me a
> few months down the line (Yes, we could probably add a comment.).
> 

I don't feel strongly about it, but a comment explaining why you can
unconditionally bump it there would be helpful.

> > 
> > > +	return ns;
> > >  }
> > >  
> > >  static inline void lock_mount_hash(void)
> > > 



-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 03/10] fs: lockless mntns rbtree lookup
  2024-12-12 23:03 ` [PATCH v3 03/10] fs: lockless mntns rbtree lookup Christian Brauner
  2024-12-13  8:50   ` Peter Zijlstra
  2024-12-13 14:11   ` Jeff Layton
@ 2024-12-19  9:20   ` Lai, Yi
  2024-12-19 13:45     ` Christian Brauner
  2 siblings, 1 reply; 20+ messages in thread
From: Lai, Yi @ 2024-12-19  9:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Josef Bacik, Jeff Layton, Paul E. McKenney, Peter Ziljstra,
	linux-fsdevel, yi1.lai

On Fri, Dec 13, 2024 at 12:03:42AM +0100, Christian Brauner wrote:
> Currently we use a read-write lock but for the simple search case we can
> make this lockless. Creating a new mount namespace is a rather rare
> event compared with querying mounts in a foreign mount namespace. Once
> this is picked up by e.g., systemd to list mounts in another mount in
> it's isolated services or in containers this will be used a lot so this
> seems worthwhile doing.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/mount.h     |   5 ++-
>  fs/namespace.c | 119 +++++++++++++++++++++++++++++++++++----------------------
>  2 files changed, 77 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/mount.h b/fs/mount.h
> index 185fc56afc13338f8185fe818051444d540cbd5b..36ead0e45e8aa7614c00001102563a711d9dae6e 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -12,7 +12,10 @@ struct mnt_namespace {
>  	struct user_namespace	*user_ns;
>  	struct ucounts		*ucounts;
>  	u64			seq;	/* Sequence number to prevent loops */
> -	wait_queue_head_t poll;
> +	union {
> +		wait_queue_head_t	poll;
> +		struct rcu_head		mnt_ns_rcu;
> +	};
>  	u64 event;
>  	unsigned int		nr_mounts; /* # of mounts in the namespace */
>  	unsigned int		pending_mounts;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 10fa18dd66018fadfdc9d18c59a851eed7bd55ad..52adee787eb1b6ee8831705b2b121854c3370fb3 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -79,6 +79,8 @@ static DECLARE_RWSEM(namespace_sem);
>  static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
>  static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
>  static DEFINE_RWLOCK(mnt_ns_tree_lock);
> +static seqcount_rwlock_t mnt_ns_tree_seqcount = SEQCNT_RWLOCK_ZERO(mnt_ns_tree_seqcount, &mnt_ns_tree_lock);
> +
>  static struct rb_root mnt_ns_tree = RB_ROOT; /* protected by mnt_ns_tree_lock */
>  
>  struct mount_kattr {
> @@ -105,17 +107,6 @@ EXPORT_SYMBOL_GPL(fs_kobj);
>   */
>  __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
>  
> -static int mnt_ns_cmp(u64 seq, const struct mnt_namespace *ns)
> -{
> -	u64 seq_b = ns->seq;
> -
> -	if (seq < seq_b)
> -		return -1;
> -	if (seq > seq_b)
> -		return 1;
> -	return 0;
> -}
> -
>  static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
>  {
>  	if (!node)
> @@ -123,19 +114,41 @@ static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
>  	return rb_entry(node, struct mnt_namespace, mnt_ns_tree_node);
>  }
>  
> -static bool mnt_ns_less(struct rb_node *a, const struct rb_node *b)
> +static int mnt_ns_cmp(struct rb_node *a, const struct rb_node *b)
>  {
>  	struct mnt_namespace *ns_a = node_to_mnt_ns(a);
>  	struct mnt_namespace *ns_b = node_to_mnt_ns(b);
>  	u64 seq_a = ns_a->seq;
> +	u64 seq_b = ns_b->seq;
> +
> +	if (seq_a < seq_b)
> +		return -1;
> +	if (seq_a > seq_b)
> +		return 1;
> +	return 0;
> +}
>  
> -	return mnt_ns_cmp(seq_a, ns_b) < 0;
> +static inline void mnt_ns_tree_write_lock(void)
> +{
> +	write_lock(&mnt_ns_tree_lock);
> +	write_seqcount_begin(&mnt_ns_tree_seqcount);
> +}
> +
> +static inline void mnt_ns_tree_write_unlock(void)
> +{
> +	write_seqcount_end(&mnt_ns_tree_seqcount);
> +	write_unlock(&mnt_ns_tree_lock);
>  }
>  
>  static void mnt_ns_tree_add(struct mnt_namespace *ns)
>  {
> -	guard(write_lock)(&mnt_ns_tree_lock);
> -	rb_add(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_less);
> +	struct rb_node *node;
> +
> +	mnt_ns_tree_write_lock();
> +	node = rb_find_add_rcu(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_cmp);
> +	mnt_ns_tree_write_unlock();
> +
> +	WARN_ON_ONCE(node);
>  }
>  
>  static void mnt_ns_release(struct mnt_namespace *ns)
> @@ -150,41 +163,36 @@ static void mnt_ns_release(struct mnt_namespace *ns)
>  }
>  DEFINE_FREE(mnt_ns_release, struct mnt_namespace *, if (_T) mnt_ns_release(_T))
>  
> +static void mnt_ns_release_rcu(struct rcu_head *rcu)
> +{
> +	struct mnt_namespace *mnt_ns;
> +
> +	mnt_ns = container_of(rcu, struct mnt_namespace, mnt_ns_rcu);
> +	mnt_ns_release(mnt_ns);
> +}
> +
>  static void mnt_ns_tree_remove(struct mnt_namespace *ns)
>  {
>  	/* remove from global mount namespace list */
>  	if (!is_anon_ns(ns)) {
> -		guard(write_lock)(&mnt_ns_tree_lock);
> +		mnt_ns_tree_write_lock();
>  		rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree);
> +		mnt_ns_tree_write_unlock();
>  	}
>  
> -	mnt_ns_release(ns);
> +	call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
>  }
>  
> -/*
> - * Returns the mount namespace which either has the specified id, or has the
> - * next smallest id afer the specified one.
> - */
> -static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
> +static int mnt_ns_find(const void *key, const struct rb_node *node)
>  {
> -	struct rb_node *node = mnt_ns_tree.rb_node;
> -	struct mnt_namespace *ret = NULL;
> -
> -	lockdep_assert_held(&mnt_ns_tree_lock);
> -
> -	while (node) {
> -		struct mnt_namespace *n = node_to_mnt_ns(node);
> +	const u64 mnt_ns_id = *(u64 *)key;
> +	const struct mnt_namespace *ns = node_to_mnt_ns(node);
>  
> -		if (mnt_ns_id <= n->seq) {
> -			ret = node_to_mnt_ns(node);
> -			if (mnt_ns_id == n->seq)
> -				break;
> -			node = node->rb_left;
> -		} else {
> -			node = node->rb_right;
> -		}
> -	}
> -	return ret;
> +	if (mnt_ns_id < ns->seq)
> +		return -1;
> +	if (mnt_ns_id > ns->seq)
> +		return 1;
> +	return 0;
>  }
>  
>  /*
> @@ -194,18 +202,37 @@ static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
>   * namespace the @namespace_sem must first be acquired. If the namespace has
>   * already shut down before acquiring @namespace_sem, {list,stat}mount() will
>   * see that the mount rbtree of the namespace is empty.
> + *
> + * Note the lookup is lockless protected by a sequence counter. We only
> + * need to guard against false negatives as false positives aren't
> + * possible. So if we didn't find a mount namespace and the sequence
> + * counter has changed we need to retry. If the sequence counter is
> + * still the same we know the search actually failed.
>   */
>  static struct mnt_namespace *lookup_mnt_ns(u64 mnt_ns_id)
>  {
> -       struct mnt_namespace *ns;
> +	struct mnt_namespace *ns;
> +	struct rb_node *node;
> +	unsigned int seq;
> +
> +	guard(rcu)();
> +	do {
> +		seq = read_seqcount_begin(&mnt_ns_tree_seqcount);
> +		node = rb_find_rcu(&mnt_ns_id, &mnt_ns_tree, mnt_ns_find);
> +		if (node)
> +			break;
> +	} while (read_seqcount_retry(&mnt_ns_tree_seqcount, seq));
>  
> -       guard(read_lock)(&mnt_ns_tree_lock);
> -       ns = mnt_ns_find_id_at(mnt_ns_id);
> -       if (!ns || ns->seq != mnt_ns_id)
> -               return NULL;
> +	if (!node)
> +		return NULL;
>  
> -       refcount_inc(&ns->passive);
> -       return ns;
> +	/*
> +	 * The last reference count is put with RCU delay so we can
> +	 * unconditonally acquire a reference here.
> +	 */
> +	ns = node_to_mnt_ns(node);
> +	refcount_inc(&ns->passive);
> +	return ns;
>  }
>  
>  static inline void lock_mount_hash(void)
> 
> -- 
> 2.45.2
>

Hi Christian Brauner ,

Greetings!

I used Syzkaller and found that there is WARNING in mnt_ns_release in linux v6.13-rc3.

After bisection and the first bad commit is:
"
5eda70f550d7 fs: lockless mntns rbtree lookup
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/241219_115855_mnt_ns_release
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/241219_115855_mnt_ns_release/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/241219_115855_mnt_ns_release/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/241219_115855_mnt_ns_release/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/241219_115855_mnt_ns_release/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/241219_115855_mnt_ns_release/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241219_115855_mnt_ns_release/bzImage_fdb298fa865b0136f7be842e6c2e6310dede421a
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/241219_115855_mnt_ns_release/fdb298fa865b0136f7be842e6c2e6310dede421a_dmesg.log

"
[  268.453608] ------------[ cut here ]------------
[  268.453889] WARNING: CPU: 0 PID: 10683 at fs/namespace.c:163 mnt_ns_release+0x18d/0x200
[  268.454274] Modules linked in:
[  268.454431] CPU: 0 UID: 0 PID: 10683 Comm: repro Not tainted 6.13.0-rc3-next-20241217-fdb298fa865b #1
[  268.454857] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qe4
[  268.455380] RIP: 0010:mnt_ns_release+0x18d/0x200
[  268.455604] Code: ff ff 48 c7 c7 30 67 29 87 e8 ff ea ac 03 bf 01 00 00 00 89 c3 89 c6 e8 91 81 7e ff 83 fb 019
[  268.456446] RSP: 0018:ffff88806c409df8 EFLAGS: 00010246
[  268.456693] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff820995df
[  268.457023] RDX: ffff888020b40000 RSI: ffffffff820995ed RDI: 0000000000000005
[  268.457359] RBP: ffff88806c409e18 R08: ffff888017a66e48 R09: fffffbfff15085a3
[  268.457861] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888017a66e00
[  268.458188] R13: ffff88806c409eb8 R14: ffffffff816e536a R15: 0000000000000003
[  268.458523] FS:  00007f7f8ffbf600(0000) GS:ffff88806c400000(0000) knlGS:0000000000000000
[  268.458880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  268.459150] CR2: 0000000000000000 CR3: 0000000017be6005 CR4: 0000000000770ef0
[  268.459486] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  268.459808] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[  268.460136] PKRU: 55555554
[  268.460271] Call Trace:
[  268.460394]  <IRQ>
[  268.460500]  ? show_regs+0x6d/0x80
[  268.460671]  ? __warn+0xf3/0x390
[  268.460830]  ? report_bug+0x25e/0x4b0
[  268.461008]  ? mnt_ns_release+0x18d/0x200
[  268.461197]  ? report_bug+0x2cb/0x4b0
[  268.461377]  ? mnt_ns_release+0x18d/0x200
[  268.461587]  ? mnt_ns_release+0x18e/0x200
[  268.461781]  ? handle_bug+0xf1/0x190
[  268.461961]  ? exc_invalid_op+0x3c/0x80
[  268.462152]  ? asm_exc_invalid_op+0x1f/0x30
[  268.462357]  ? rcu_core+0x86a/0x1920
[  268.462539]  ? mnt_ns_release+0x17f/0x200
[  268.462733]  ? mnt_ns_release+0x18d/0x200
[  268.462929]  ? mnt_ns_release+0x18d/0x200
[  268.463118]  ? mnt_ns_release+0x18d/0x200
[  268.463313]  ? rcu_core+0x86a/0x1920
[  268.463491]  mnt_ns_release_rcu+0x1f/0x30
[  268.463688]  rcu_core+0x86c/0x1920
[  268.463863]  ? __pfx_rcu_core+0x10/0x10
[  268.464058]  rcu_core_si+0x12/0x20
[  268.464229]  handle_softirqs+0x1c5/0x860
[  268.464431]  __irq_exit_rcu+0x10e/0x170
[  268.464620]  irq_exit_rcu+0x12/0x30
[  268.464795]  sysvec_apic_timer_interrupt+0xb4/0xd0
[  268.465029]  </IRQ>
[  268.465139]  <TASK>
[  268.465251]  asm_sysvec_apic_timer_interrupt+0x1f/0x30
[  268.465517] RIP: 0010:mnt_ns_tree_add+0xbe/0x490
[  268.465745] Code: 0f 85 34 03 00 00 49 bd 00 00 00 00 00 fc ff df 4d 8b 64 24 38 e8 c2 7a 7e ff 48 8d 7b 98 48d
[  268.466596] RSP: 0018:ffff88801f1b7cc8 EFLAGS: 00000246
[  268.466845] RAX: 1ffff11001afa347 RBX: ffff88800d7d1aa0 RCX: ffffffff8209996c
[  268.467180] RDX: ffff888020b40000 RSI: ffffffff8209992e RDI: ffff88800d7d1a38
[  268.467517] RBP: ffff88801f1b7cf8 R08: 0000000000000000 R09: fffffbfff15085a2
[  268.467844] R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000002725
[  268.468167] R13: dffffc0000000000 R14: ffff888014a71c00 R15: ffff88800d7d12a8
[  268.468501]  ? mnt_ns_tree_add+0xec/0x490
[  268.468697]  ? mnt_ns_tree_add+0xae/0x490
[  268.468891]  ? mnt_ns_tree_add+0xae/0x490
[  268.469088]  copy_mnt_ns+0x5ed/0xa90
[  268.469272]  create_new_namespaces+0xe2/0xb40
[  268.469514]  ? security_capable+0x19d/0x1e0
[  268.469727]  unshare_nsproxy_namespaces+0xca/0x200
[  268.469959]  ksys_unshare+0x482/0xae0
[  268.470138]  ? __pfx_ksys_unshare+0x10/0x10
[  268.470346]  ? __audit_syscall_entry+0x39c/0x500
[  268.470572]  __x64_sys_unshare+0x3a/0x50
[  268.470760]  x64_sys_call+0xd3e/0x2140
[  268.470942]  do_syscall_64+0x6d/0x140
[  268.471118]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  268.471362] RIP: 0033:0x7f7f8fc3ee5d
[  268.471535] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 898
[  268.472375] RSP: 002b:00007ffd32de1248 EFLAGS: 00000286 ORIG_RAX: 0000000000000110
[  268.472719] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7f8fc3ee5d
[  268.473052] RDX: ffffffffffffff80 RSI: ffffffffffffff80 RDI: 0000000020060480
[  268.473386] RBP: 00007ffd32de1250 R08: 00007ffd32de1280 R09: 00007ffd32de1280
[  268.473720] R10: 00007ffd32de1280 R11: 0000000000000286 R12: 00007ffd32de13a8
[  268.474046] R13: 0000000000401713 R14: 0000000000403e08 R15: 00007f7f90006000
[  268.474387]  </TASK>
[  268.474499] irq event stamp: 1528
[  268.474665] hardirqs last  enabled at (1536): [<ffffffff81662885>] __up_console_sem+0x95/0xb0
[  268.475065] hardirqs last disabled at (1543): [<ffffffff8166286a>] __up_console_sem+0x7a/0xb0
[  268.475454] softirqs last  enabled at (0): [<ffffffff81463a8e>] copy_process+0x1d4e/0x6a40
[  268.475830] softirqs last disabled at (661): [<ffffffff8148a84e>] __irq_exit_rcu+0x10e/0x170
[  268.476213] ---[ end trace 0000000000000000 ]---
"

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 03/10] fs: lockless mntns rbtree lookup
  2024-12-19  9:20   ` Lai, Yi
@ 2024-12-19 13:45     ` Christian Brauner
  0 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-19 13:45 UTC (permalink / raw)
  To: Lai, Yi
  Cc: Josef Bacik, Jeff Layton, Paul E. McKenney, Peter Ziljstra,
	linux-fsdevel, yi1.lai

On Thu, Dec 19, 2024 at 05:20:15PM +0800, Lai, Yi wrote:
> On Fri, Dec 13, 2024 at 12:03:42AM +0100, Christian Brauner wrote:
> > Currently we use a read-write lock but for the simple search case we can
> > make this lockless. Creating a new mount namespace is a rather rare
> > event compared with querying mounts in a foreign mount namespace. Once
> > this is picked up by e.g., systemd to list mounts in another mount in
> > it's isolated services or in containers this will be used a lot so this
> > seems worthwhile doing.
> > 
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/mount.h     |   5 ++-
> >  fs/namespace.c | 119 +++++++++++++++++++++++++++++++++++----------------------
> >  2 files changed, 77 insertions(+), 47 deletions(-)
> > 
> > diff --git a/fs/mount.h b/fs/mount.h
> > index 185fc56afc13338f8185fe818051444d540cbd5b..36ead0e45e8aa7614c00001102563a711d9dae6e 100644
> > --- a/fs/mount.h
> > +++ b/fs/mount.h
> > @@ -12,7 +12,10 @@ struct mnt_namespace {
> >  	struct user_namespace	*user_ns;
> >  	struct ucounts		*ucounts;
> >  	u64			seq;	/* Sequence number to prevent loops */
> > -	wait_queue_head_t poll;
> > +	union {
> > +		wait_queue_head_t	poll;
> > +		struct rcu_head		mnt_ns_rcu;
> > +	};
> >  	u64 event;
> >  	unsigned int		nr_mounts; /* # of mounts in the namespace */
> >  	unsigned int		pending_mounts;
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 10fa18dd66018fadfdc9d18c59a851eed7bd55ad..52adee787eb1b6ee8831705b2b121854c3370fb3 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -79,6 +79,8 @@ static DECLARE_RWSEM(namespace_sem);
> >  static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
> >  static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
> >  static DEFINE_RWLOCK(mnt_ns_tree_lock);
> > +static seqcount_rwlock_t mnt_ns_tree_seqcount = SEQCNT_RWLOCK_ZERO(mnt_ns_tree_seqcount, &mnt_ns_tree_lock);
> > +
> >  static struct rb_root mnt_ns_tree = RB_ROOT; /* protected by mnt_ns_tree_lock */
> >  
> >  struct mount_kattr {
> > @@ -105,17 +107,6 @@ EXPORT_SYMBOL_GPL(fs_kobj);
> >   */
> >  __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock);
> >  
> > -static int mnt_ns_cmp(u64 seq, const struct mnt_namespace *ns)
> > -{
> > -	u64 seq_b = ns->seq;
> > -
> > -	if (seq < seq_b)
> > -		return -1;
> > -	if (seq > seq_b)
> > -		return 1;
> > -	return 0;
> > -}
> > -
> >  static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
> >  {
> >  	if (!node)
> > @@ -123,19 +114,41 @@ static inline struct mnt_namespace *node_to_mnt_ns(const struct rb_node *node)
> >  	return rb_entry(node, struct mnt_namespace, mnt_ns_tree_node);
> >  }
> >  
> > -static bool mnt_ns_less(struct rb_node *a, const struct rb_node *b)
> > +static int mnt_ns_cmp(struct rb_node *a, const struct rb_node *b)
> >  {
> >  	struct mnt_namespace *ns_a = node_to_mnt_ns(a);
> >  	struct mnt_namespace *ns_b = node_to_mnt_ns(b);
> >  	u64 seq_a = ns_a->seq;
> > +	u64 seq_b = ns_b->seq;
> > +
> > +	if (seq_a < seq_b)
> > +		return -1;
> > +	if (seq_a > seq_b)
> > +		return 1;
> > +	return 0;
> > +}
> >  
> > -	return mnt_ns_cmp(seq_a, ns_b) < 0;
> > +static inline void mnt_ns_tree_write_lock(void)
> > +{
> > +	write_lock(&mnt_ns_tree_lock);
> > +	write_seqcount_begin(&mnt_ns_tree_seqcount);
> > +}
> > +
> > +static inline void mnt_ns_tree_write_unlock(void)
> > +{
> > +	write_seqcount_end(&mnt_ns_tree_seqcount);
> > +	write_unlock(&mnt_ns_tree_lock);
> >  }
> >  
> >  static void mnt_ns_tree_add(struct mnt_namespace *ns)
> >  {
> > -	guard(write_lock)(&mnt_ns_tree_lock);
> > -	rb_add(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_less);
> > +	struct rb_node *node;
> > +
> > +	mnt_ns_tree_write_lock();
> > +	node = rb_find_add_rcu(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_cmp);
> > +	mnt_ns_tree_write_unlock();
> > +
> > +	WARN_ON_ONCE(node);
> >  }
> >  
> >  static void mnt_ns_release(struct mnt_namespace *ns)
> > @@ -150,41 +163,36 @@ static void mnt_ns_release(struct mnt_namespace *ns)
> >  }
> >  DEFINE_FREE(mnt_ns_release, struct mnt_namespace *, if (_T) mnt_ns_release(_T))
> >  
> > +static void mnt_ns_release_rcu(struct rcu_head *rcu)
> > +{
> > +	struct mnt_namespace *mnt_ns;
> > +
> > +	mnt_ns = container_of(rcu, struct mnt_namespace, mnt_ns_rcu);
> > +	mnt_ns_release(mnt_ns);
> > +}
> > +
> >  static void mnt_ns_tree_remove(struct mnt_namespace *ns)
> >  {
> >  	/* remove from global mount namespace list */
> >  	if (!is_anon_ns(ns)) {
> > -		guard(write_lock)(&mnt_ns_tree_lock);
> > +		mnt_ns_tree_write_lock();
> >  		rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree);
> > +		mnt_ns_tree_write_unlock();
> >  	}
> >  
> > -	mnt_ns_release(ns);
> > +	call_rcu(&ns->mnt_ns_rcu, mnt_ns_release_rcu);
> >  }
> >  
> > -/*
> > - * Returns the mount namespace which either has the specified id, or has the
> > - * next smallest id afer the specified one.
> > - */
> > -static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
> > +static int mnt_ns_find(const void *key, const struct rb_node *node)
> >  {
> > -	struct rb_node *node = mnt_ns_tree.rb_node;
> > -	struct mnt_namespace *ret = NULL;
> > -
> > -	lockdep_assert_held(&mnt_ns_tree_lock);
> > -
> > -	while (node) {
> > -		struct mnt_namespace *n = node_to_mnt_ns(node);
> > +	const u64 mnt_ns_id = *(u64 *)key;
> > +	const struct mnt_namespace *ns = node_to_mnt_ns(node);
> >  
> > -		if (mnt_ns_id <= n->seq) {
> > -			ret = node_to_mnt_ns(node);
> > -			if (mnt_ns_id == n->seq)
> > -				break;
> > -			node = node->rb_left;
> > -		} else {
> > -			node = node->rb_right;
> > -		}
> > -	}
> > -	return ret;
> > +	if (mnt_ns_id < ns->seq)
> > +		return -1;
> > +	if (mnt_ns_id > ns->seq)
> > +		return 1;
> > +	return 0;
> >  }
> >  
> >  /*
> > @@ -194,18 +202,37 @@ static struct mnt_namespace *mnt_ns_find_id_at(u64 mnt_ns_id)
> >   * namespace the @namespace_sem must first be acquired. If the namespace has
> >   * already shut down before acquiring @namespace_sem, {list,stat}mount() will
> >   * see that the mount rbtree of the namespace is empty.
> > + *
> > + * Note the lookup is lockless protected by a sequence counter. We only
> > + * need to guard against false negatives as false positives aren't
> > + * possible. So if we didn't find a mount namespace and the sequence
> > + * counter has changed we need to retry. If the sequence counter is
> > + * still the same we know the search actually failed.
> >   */
> >  static struct mnt_namespace *lookup_mnt_ns(u64 mnt_ns_id)
> >  {
> > -       struct mnt_namespace *ns;
> > +	struct mnt_namespace *ns;
> > +	struct rb_node *node;
> > +	unsigned int seq;
> > +
> > +	guard(rcu)();
> > +	do {
> > +		seq = read_seqcount_begin(&mnt_ns_tree_seqcount);
> > +		node = rb_find_rcu(&mnt_ns_id, &mnt_ns_tree, mnt_ns_find);
> > +		if (node)
> > +			break;
> > +	} while (read_seqcount_retry(&mnt_ns_tree_seqcount, seq));
> >  
> > -       guard(read_lock)(&mnt_ns_tree_lock);
> > -       ns = mnt_ns_find_id_at(mnt_ns_id);
> > -       if (!ns || ns->seq != mnt_ns_id)
> > -               return NULL;
> > +	if (!node)
> > +		return NULL;
> >  
> > -       refcount_inc(&ns->passive);
> > -       return ns;
> > +	/*
> > +	 * The last reference count is put with RCU delay so we can
> > +	 * unconditonally acquire a reference here.
> > +	 */
> > +	ns = node_to_mnt_ns(node);
> > +	refcount_inc(&ns->passive);
> > +	return ns;
> >  }
> >  
> >  static inline void lock_mount_hash(void)
> > 
> > -- 
> > 2.45.2
> >
> 
> Hi Christian Brauner ,
> 
> Greetings!
> 
> I used Syzkaller and found that there is WARNING in mnt_ns_release in linux v6.13-rc3.

Right, the lockdep assertion is wrong and needs to be dropped.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 04/10] rculist: add list_bidir_{del,prev}_rcu()
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
                   ` (2 preceding siblings ...)
  2024-12-12 23:03 ` [PATCH v3 03/10] fs: lockless mntns rbtree lookup Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-13  0:42   ` Paul E. McKenney
  2024-12-12 23:03 ` [PATCH v3 05/10] fs: lockless mntns lookup for nsfs Christian Brauner
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

Currently there is no primitive for retrieving the previous list member.
To do this we need a new deletion primitive that doesn't poison the prev
pointer and a corresponding retrieval helper. Note that it is not valid
to ues both list_del_rcu() and list_bidir_del_rcu() on the same list.

Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/rculist.h | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 14dfa6008467e803d57f98cfa0275569f1c6a181..270a9ee2f7976b1736545667973265a3bfb7ec41 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -30,6 +30,17 @@ static inline void INIT_LIST_HEAD_RCU(struct list_head *list)
  * way, we must not access it directly
  */
 #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
+/*
+ * Return the ->prev pointer of a list_head in an rcu safe way. Don't
+ * access it directly.
+ *
+ * Any list traversed with list_bidir_prev_rcu() must never use
+ * list_del_rcu().  Doing so will poison the ->prev pointer that
+ * list_bidir_prev_rcu() relies on, which will result in segfaults.
+ * To prevent these segfaults, use list_bidir_del_rcu() instead
+ * of list_del_rcu().
+ */
+#define list_bidir_prev_rcu(list) (*((struct list_head __rcu **)(&(list)->prev)))
 
 /**
  * list_tail_rcu - returns the prev pointer of the head of the list
@@ -158,6 +169,42 @@ static inline void list_del_rcu(struct list_head *entry)
 	entry->prev = LIST_POISON2;
 }
 
+/**
+ * list_bidir_del_rcu - deletes entry from list without re-initialization
+ * @entry: the element to delete from the list.
+ *
+ * In contrast to list_del_rcu() doesn't poison the prev pointer thus
+ * allowing backwards traversal via list_bidir_prev_rcu().
+ *
+ * Note: list_empty() on entry does not return true after this because
+ * the entry is in a special undefined state that permits RCU-based
+ * lockfree reverse traversal. In particular this means that we can not
+ * poison the forward and backwards pointers that may still be used for
+ * walking the list.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another list-mutation
+ * primitive, such as list_bidir_del_rcu() or list_add_rcu(), running on
+ * this same list. However, it is perfectly legal to run concurrently
+ * with the _rcu list-traversal primitives, such as
+ * list_for_each_entry_rcu().
+ *
+ * Noe that the it is not allowed to use list_del_rcu() and
+ * list_bidir_del_rcu() on the same list.
+ *
+ * Note that list_del_rcu() and list_bidir_del_rcu() must not be used on
+ * the same list.
+ *
+ * Note that the caller is not permitted to immediately free
+ * the newly deleted entry.  Instead, either synchronize_rcu()
+ * or call_rcu() must be used to defer freeing until an RCU
+ * grace period has elapsed.
+ */
+static inline void list_bidir_del_rcu(struct list_head *entry)
+{
+	__list_del_entry(entry);
+}
+
 /**
  * hlist_del_init_rcu - deletes entry from hash list with re-initialization
  * @n: the element to delete from the hash list.

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 04/10] rculist: add list_bidir_{del,prev}_rcu()
  2024-12-12 23:03 ` [PATCH v3 04/10] rculist: add list_bidir_{del,prev}_rcu() Christian Brauner
@ 2024-12-13  0:42   ` Paul E. McKenney
  2024-12-13 13:49     ` Christian Brauner
  0 siblings, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2024-12-13  0:42 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Josef Bacik, Jeff Layton, Peter Ziljstra, linux-fsdevel

On Fri, Dec 13, 2024 at 12:03:43AM +0100, Christian Brauner wrote:
> Currently there is no primitive for retrieving the previous list member.
> To do this we need a new deletion primitive that doesn't poison the prev
> pointer and a corresponding retrieval helper. Note that it is not valid
> to ues both list_del_rcu() and list_bidir_del_rcu() on the same list.
> 
> Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

One additional nit below.  With that fixed:

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  include/linux/rculist.h | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 14dfa6008467e803d57f98cfa0275569f1c6a181..270a9ee2f7976b1736545667973265a3bfb7ec41 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -30,6 +30,17 @@ static inline void INIT_LIST_HEAD_RCU(struct list_head *list)
>   * way, we must not access it directly
>   */
>  #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
> +/*
> + * Return the ->prev pointer of a list_head in an rcu safe way. Don't
> + * access it directly.
> + *
> + * Any list traversed with list_bidir_prev_rcu() must never use
> + * list_del_rcu().  Doing so will poison the ->prev pointer that
> + * list_bidir_prev_rcu() relies on, which will result in segfaults.
> + * To prevent these segfaults, use list_bidir_del_rcu() instead
> + * of list_del_rcu().
> + */
> +#define list_bidir_prev_rcu(list) (*((struct list_head __rcu **)(&(list)->prev)))
>  
>  /**
>   * list_tail_rcu - returns the prev pointer of the head of the list
> @@ -158,6 +169,42 @@ static inline void list_del_rcu(struct list_head *entry)
>  	entry->prev = LIST_POISON2;
>  }
>  
> +/**
> + * list_bidir_del_rcu - deletes entry from list without re-initialization
> + * @entry: the element to delete from the list.
> + *
> + * In contrast to list_del_rcu() doesn't poison the prev pointer thus
> + * allowing backwards traversal via list_bidir_prev_rcu().
> + *
> + * Note: list_empty() on entry does not return true after this because
> + * the entry is in a special undefined state that permits RCU-based
> + * lockfree reverse traversal. In particular this means that we can not
> + * poison the forward and backwards pointers that may still be used for
> + * walking the list.
> + *
> + * The caller must take whatever precautions are necessary (such as
> + * holding appropriate locks) to avoid racing with another list-mutation
> + * primitive, such as list_bidir_del_rcu() or list_add_rcu(), running on
> + * this same list. However, it is perfectly legal to run concurrently
> + * with the _rcu list-traversal primitives, such as
> + * list_for_each_entry_rcu().
> + *
> + * Noe that the it is not allowed to use list_del_rcu() and
> + * list_bidir_del_rcu() on the same list.

I am guessing that the above paragraph is a leftover?

> + * Note that list_del_rcu() and list_bidir_del_rcu() must not be used on
> + * the same list.
> + *
> + * Note that the caller is not permitted to immediately free
> + * the newly deleted entry.  Instead, either synchronize_rcu()
> + * or call_rcu() must be used to defer freeing until an RCU
> + * grace period has elapsed.
> + */
> +static inline void list_bidir_del_rcu(struct list_head *entry)
> +{
> +	__list_del_entry(entry);
> +}
> +
>  /**
>   * hlist_del_init_rcu - deletes entry from hash list with re-initialization
>   * @n: the element to delete from the hash list.
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 04/10] rculist: add list_bidir_{del,prev}_rcu()
  2024-12-13  0:42   ` Paul E. McKenney
@ 2024-12-13 13:49     ` Christian Brauner
  0 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-13 13:49 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Josef Bacik, Jeff Layton, Peter Ziljstra, linux-fsdevel

On Thu, Dec 12, 2024 at 04:42:54PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 13, 2024 at 12:03:43AM +0100, Christian Brauner wrote:
> > Currently there is no primitive for retrieving the previous list member.
> > To do this we need a new deletion primitive that doesn't poison the prev
> > pointer and a corresponding retrieval helper. Note that it is not valid
> > to ues both list_del_rcu() and list_bidir_del_rcu() on the same list.
> > 
> > Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> One additional nit below.  With that fixed:
> 
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

Thansk, Paul!

> 
> > ---
> >  include/linux/rculist.h | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 47 insertions(+)
> > 
> > diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> > index 14dfa6008467e803d57f98cfa0275569f1c6a181..270a9ee2f7976b1736545667973265a3bfb7ec41 100644
> > --- a/include/linux/rculist.h
> > +++ b/include/linux/rculist.h
> > @@ -30,6 +30,17 @@ static inline void INIT_LIST_HEAD_RCU(struct list_head *list)
> >   * way, we must not access it directly
> >   */
> >  #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
> > +/*
> > + * Return the ->prev pointer of a list_head in an rcu safe way. Don't
> > + * access it directly.
> > + *
> > + * Any list traversed with list_bidir_prev_rcu() must never use
> > + * list_del_rcu().  Doing so will poison the ->prev pointer that
> > + * list_bidir_prev_rcu() relies on, which will result in segfaults.
> > + * To prevent these segfaults, use list_bidir_del_rcu() instead
> > + * of list_del_rcu().
> > + */
> > +#define list_bidir_prev_rcu(list) (*((struct list_head __rcu **)(&(list)->prev)))
> >  
> >  /**
> >   * list_tail_rcu - returns the prev pointer of the head of the list
> > @@ -158,6 +169,42 @@ static inline void list_del_rcu(struct list_head *entry)
> >  	entry->prev = LIST_POISON2;
> >  }
> >  
> > +/**
> > + * list_bidir_del_rcu - deletes entry from list without re-initialization
> > + * @entry: the element to delete from the list.
> > + *
> > + * In contrast to list_del_rcu() doesn't poison the prev pointer thus
> > + * allowing backwards traversal via list_bidir_prev_rcu().
> > + *
> > + * Note: list_empty() on entry does not return true after this because
> > + * the entry is in a special undefined state that permits RCU-based
> > + * lockfree reverse traversal. In particular this means that we can not
> > + * poison the forward and backwards pointers that may still be used for
> > + * walking the list.
> > + *
> > + * The caller must take whatever precautions are necessary (such as
> > + * holding appropriate locks) to avoid racing with another list-mutation
> > + * primitive, such as list_bidir_del_rcu() or list_add_rcu(), running on
> > + * this same list. However, it is perfectly legal to run concurrently
> > + * with the _rcu list-traversal primitives, such as
> > + * list_for_each_entry_rcu().
> > + *
> > + * Noe that the it is not allowed to use list_del_rcu() and
> > + * list_bidir_del_rcu() on the same list.
> 
> I am guessing that the above paragraph is a leftover?

Indeed, fixed!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 05/10] fs: lockless mntns lookup for nsfs
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
                   ` (3 preceding siblings ...)
  2024-12-12 23:03 ` [PATCH v3 04/10] rculist: add list_bidir_{del,prev}_rcu() Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 06/10] fs: simplify rwlock to spinlock Christian Brauner
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

We already made the rbtree lookup lockless for the simple lookup case.
However, walking the list of mount namespaces via nsfs still happens
with taking the read lock blocking concurrent additions of new mount
namespaces pointlessly. Plus, such additions are rare anyway so allow
lockless lookup of the previous and next mount namespace by keeping a
separate list. This also allows to make some things simpler in the code.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/mount.h     | 13 ++++---------
 fs/namespace.c | 42 +++++++++++++++++++++++++++++-------------
 fs/nsfs.c      |  5 +----
 3 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 36ead0e45e8aa7614c00001102563a711d9dae6e..8cda387f47c5efd9af5e2e422569446c3d51986f 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -20,6 +20,7 @@ struct mnt_namespace {
 	unsigned int		nr_mounts; /* # of mounts in the namespace */
 	unsigned int		pending_mounts;
 	struct rb_node		mnt_ns_tree_node; /* node in the mnt_ns_tree */
+	struct list_head	mnt_ns_list; /* entry in the sequential list of mounts namespace */
 	refcount_t		passive; /* number references not pinning @mounts */
 } __randomize_layout;
 
@@ -157,15 +158,9 @@ static inline void move_from_ns(struct mount *mnt, struct list_head *dt_list)
 }
 
 bool has_locked_children(struct mount *mnt, struct dentry *dentry);
-struct mnt_namespace *__lookup_next_mnt_ns(struct mnt_namespace *mnt_ns, bool previous);
-static inline struct mnt_namespace *lookup_next_mnt_ns(struct mnt_namespace *mntns)
-{
-	return __lookup_next_mnt_ns(mntns, false);
-}
-static inline struct mnt_namespace *lookup_prev_mnt_ns(struct mnt_namespace *mntns)
-{
-	return __lookup_next_mnt_ns(mntns, true);
-}
+struct mnt_namespace *get_sequential_mnt_ns(struct mnt_namespace *mnt_ns,
+					    bool previous);
+
 static inline struct mnt_namespace *to_mnt_ns(struct ns_common *ns)
 {
 	return container_of(ns, struct mnt_namespace, ns);
diff --git a/fs/namespace.c b/fs/namespace.c
index 52adee787eb1b6ee8831705b2b121854c3370fb3..71509309652315e5aa9c6b16d13de678bf1c98b3 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -82,6 +82,7 @@ static DEFINE_RWLOCK(mnt_ns_tree_lock);
 static seqcount_rwlock_t mnt_ns_tree_seqcount = SEQCNT_RWLOCK_ZERO(mnt_ns_tree_seqcount, &mnt_ns_tree_lock);
 
 static struct rb_root mnt_ns_tree = RB_ROOT; /* protected by mnt_ns_tree_lock */
+static LIST_HEAD(mnt_ns_list); /* protected by mnt_ns_tree_lock */
 
 struct mount_kattr {
 	unsigned int attr_set;
@@ -142,10 +143,19 @@ static inline void mnt_ns_tree_write_unlock(void)
 
 static void mnt_ns_tree_add(struct mnt_namespace *ns)
 {
-	struct rb_node *node;
+	struct rb_node *node, *prev;
 
 	mnt_ns_tree_write_lock();
 	node = rb_find_add_rcu(&ns->mnt_ns_tree_node, &mnt_ns_tree, mnt_ns_cmp);
+	/*
+	 * If there's no previous entry simply add it after the
+	 * head and if there is add it after the previous entry.
+	 */
+	prev = rb_prev(&ns->mnt_ns_tree_node);
+	if (!prev)
+		list_add_rcu(&ns->mnt_ns_list, &mnt_ns_list);
+	else
+		list_add_rcu(&ns->mnt_ns_list, &node_to_mnt_ns(prev)->mnt_ns_list);
 	mnt_ns_tree_write_unlock();
 
 	WARN_ON_ONCE(node);
@@ -177,6 +187,7 @@ static void mnt_ns_tree_remove(struct mnt_namespace *ns)
 	if (!is_anon_ns(ns)) {
 		mnt_ns_tree_write_lock();
 		rb_erase(&ns->mnt_ns_tree_node, &mnt_ns_tree);
+		list_bidir_del_rcu(&ns->mnt_ns_list);
 		mnt_ns_tree_write_unlock();
 	}
 
@@ -2091,30 +2102,34 @@ struct ns_common *from_mnt_ns(struct mnt_namespace *mnt)
 	return &mnt->ns;
 }
 
-struct mnt_namespace *__lookup_next_mnt_ns(struct mnt_namespace *mntns, bool previous)
+struct mnt_namespace *get_sequential_mnt_ns(struct mnt_namespace *mntns, bool previous)
 {
-	guard(read_lock)(&mnt_ns_tree_lock);
+	guard(rcu)();
+
 	for (;;) {
-		struct rb_node *node;
+		struct list_head *list;
 
 		if (previous)
-			node = rb_prev(&mntns->mnt_ns_tree_node);
+			list = rcu_dereference(list_bidir_prev_rcu(&mntns->mnt_ns_list));
 		else
-			node = rb_next(&mntns->mnt_ns_tree_node);
-		if (!node)
+			list = rcu_dereference(list_next_rcu(&mntns->mnt_ns_list));
+		if (list_is_head(list, &mnt_ns_list))
 			return ERR_PTR(-ENOENT);
 
-		mntns = node_to_mnt_ns(node);
-		node = &mntns->mnt_ns_tree_node;
+		mntns = list_entry_rcu(list, struct mnt_namespace, mnt_ns_list);
 
+		/*
+		 * The last passive reference count is put with RCU
+		 * delay so accessing the mount namespace is not just
+		 * safe but all relevant members are still valid.
+		 */
 		if (!ns_capable_noaudit(mntns->user_ns, CAP_SYS_ADMIN))
 			continue;
 
 		/*
-		 * Holding mnt_ns_tree_lock prevents the mount namespace from
-		 * being freed but it may well be on it's deathbed. We want an
-		 * active reference, not just a passive one here as we're
-		 * persisting the mount namespace.
+		 * We need an active reference count as we're persisting
+		 * the mount namespace and it might already be on its
+		 * deathbed.
 		 */
 		if (!refcount_inc_not_zero(&mntns->ns.count))
 			continue;
@@ -3931,6 +3946,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	refcount_set(&new_ns->ns.count, 1);
 	refcount_set(&new_ns->passive, 1);
 	new_ns->mounts = RB_ROOT;
+	INIT_LIST_HEAD(&new_ns->mnt_ns_list);
 	RB_CLEAR_NODE(&new_ns->mnt_ns_tree_node);
 	init_waitqueue_head(&new_ns->poll);
 	new_ns->user_ns = get_user_ns(user_ns);
diff --git a/fs/nsfs.c b/fs/nsfs.c
index c675fc40ce2dc674f0dafce5c4924b910a73a23f..663f8656158d52d391ba80ef1d320197d3d654e0 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -274,10 +274,7 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl,
 		if (usize < MNT_NS_INFO_SIZE_VER0)
 			return -EINVAL;
 
-		if (previous)
-			mnt_ns = lookup_prev_mnt_ns(to_mnt_ns(ns));
-		else
-			mnt_ns = lookup_next_mnt_ns(to_mnt_ns(ns));
+		mnt_ns = get_sequential_mnt_ns(to_mnt_ns(ns), previous);
 		if (IS_ERR(mnt_ns))
 			return PTR_ERR(mnt_ns);
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 06/10] fs: simplify rwlock to spinlock
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
                   ` (4 preceding siblings ...)
  2024-12-12 23:03 ` [PATCH v3 05/10] fs: lockless mntns lookup for nsfs Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 07/10] seltests: move nsfs into filesystems subfolder Christian Brauner
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

We're not taking the read_lock() anymore now that all lookup is
lockless. Just use a simple spinlock.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 71509309652315e5aa9c6b16d13de678bf1c98b3..966dcd27c81cc877837eca747babe0bc31aaf922 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -78,8 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init;
 static DECLARE_RWSEM(namespace_sem);
 static HLIST_HEAD(unmounted);	/* protected by namespace_sem */
 static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */
-static DEFINE_RWLOCK(mnt_ns_tree_lock);
-static seqcount_rwlock_t mnt_ns_tree_seqcount = SEQCNT_RWLOCK_ZERO(mnt_ns_tree_seqcount, &mnt_ns_tree_lock);
+static DEFINE_SEQLOCK(mnt_ns_tree_lock);
 
 static struct rb_root mnt_ns_tree = RB_ROOT; /* protected by mnt_ns_tree_lock */
 static LIST_HEAD(mnt_ns_list); /* protected by mnt_ns_tree_lock */
@@ -131,14 +130,12 @@ static int mnt_ns_cmp(struct rb_node *a, const struct rb_node *b)
 
 static inline void mnt_ns_tree_write_lock(void)
 {
-	write_lock(&mnt_ns_tree_lock);
-	write_seqcount_begin(&mnt_ns_tree_seqcount);
+	write_seqlock(&mnt_ns_tree_lock);
 }
 
 static inline void mnt_ns_tree_write_unlock(void)
 {
-	write_seqcount_end(&mnt_ns_tree_seqcount);
-	write_unlock(&mnt_ns_tree_lock);
+	write_sequnlock(&mnt_ns_tree_lock);
 }
 
 static void mnt_ns_tree_add(struct mnt_namespace *ns)
@@ -163,7 +160,7 @@ static void mnt_ns_tree_add(struct mnt_namespace *ns)
 
 static void mnt_ns_release(struct mnt_namespace *ns)
 {
-	lockdep_assert_not_held(&mnt_ns_tree_lock);
+	lockdep_assert_not_held(&mnt_ns_tree_lock.lock);
 
 	/* keep alive for {list,stat}mount() */
 	if (refcount_dec_and_test(&ns->passive)) {
@@ -228,11 +225,11 @@ static struct mnt_namespace *lookup_mnt_ns(u64 mnt_ns_id)
 
 	guard(rcu)();
 	do {
-		seq = read_seqcount_begin(&mnt_ns_tree_seqcount);
+		seq = read_seqbegin(&mnt_ns_tree_lock);
 		node = rb_find_rcu(&mnt_ns_id, &mnt_ns_tree, mnt_ns_find);
 		if (node)
 			break;
-	} while (read_seqcount_retry(&mnt_ns_tree_seqcount, seq));
+	} while (read_seqretry(&mnt_ns_tree_lock, seq));
 
 	if (!node)
 		return NULL;

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 07/10] seltests: move nsfs into filesystems subfolder
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
                   ` (5 preceding siblings ...)
  2024-12-12 23:03 ` [PATCH v3 06/10] fs: simplify rwlock to spinlock Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 08/10] selftests: add tests for mntns iteration Christian Brauner
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

I'm going to be adding new tests for it and it belongs under
filesystem selftests.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/{ => filesystems}/nsfs/.gitignore | 0
 tools/testing/selftests/{ => filesystems}/nsfs/Makefile   | 2 +-
 tools/testing/selftests/{ => filesystems}/nsfs/config     | 0
 tools/testing/selftests/{ => filesystems}/nsfs/owner.c    | 0
 tools/testing/selftests/{ => filesystems}/nsfs/pidns.c    | 0
 5 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/nsfs/.gitignore b/tools/testing/selftests/filesystems/nsfs/.gitignore
similarity index 100%
rename from tools/testing/selftests/nsfs/.gitignore
rename to tools/testing/selftests/filesystems/nsfs/.gitignore
diff --git a/tools/testing/selftests/nsfs/Makefile b/tools/testing/selftests/filesystems/nsfs/Makefile
similarity index 82%
rename from tools/testing/selftests/nsfs/Makefile
rename to tools/testing/selftests/filesystems/nsfs/Makefile
index dd9bd50b7b936e3ff16274260e149bc8d9cd23f3..c2f3ca6e488e9ddb49514e1b8e93909d5594259b 100644
--- a/tools/testing/selftests/nsfs/Makefile
+++ b/tools/testing/selftests/filesystems/nsfs/Makefile
@@ -3,4 +3,4 @@ TEST_GEN_PROGS := owner pidns
 
 CFLAGS := -Wall -Werror
 
-include ../lib.mk
+include ../../lib.mk
diff --git a/tools/testing/selftests/nsfs/config b/tools/testing/selftests/filesystems/nsfs/config
similarity index 100%
rename from tools/testing/selftests/nsfs/config
rename to tools/testing/selftests/filesystems/nsfs/config
diff --git a/tools/testing/selftests/nsfs/owner.c b/tools/testing/selftests/filesystems/nsfs/owner.c
similarity index 100%
rename from tools/testing/selftests/nsfs/owner.c
rename to tools/testing/selftests/filesystems/nsfs/owner.c
diff --git a/tools/testing/selftests/nsfs/pidns.c b/tools/testing/selftests/filesystems/nsfs/pidns.c
similarity index 100%
rename from tools/testing/selftests/nsfs/pidns.c
rename to tools/testing/selftests/filesystems/nsfs/pidns.c

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 08/10] selftests: add tests for mntns iteration
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
                   ` (6 preceding siblings ...)
  2024-12-12 23:03 ` [PATCH v3 07/10] seltests: move nsfs into filesystems subfolder Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 09/10] selftests: remove unneeded include Christian Brauner
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

Test that forward and backward iteration works correctly.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../testing/selftests/filesystems/nsfs/.gitignore  |   1 +
 tools/testing/selftests/filesystems/nsfs/Makefile  |   2 +-
 .../selftests/filesystems/nsfs/iterate_mntns.c     | 149 +++++++++++++++++++++
 3 files changed, 151 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/filesystems/nsfs/.gitignore b/tools/testing/selftests/filesystems/nsfs/.gitignore
index ed79ebdf286e4d945cfbbf80fb072ba3e05c9112..92a8249006d1e0817800df0057183a94ef0f939d 100644
--- a/tools/testing/selftests/filesystems/nsfs/.gitignore
+++ b/tools/testing/selftests/filesystems/nsfs/.gitignore
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 owner
 pidns
+iterate_mntns
diff --git a/tools/testing/selftests/filesystems/nsfs/Makefile b/tools/testing/selftests/filesystems/nsfs/Makefile
index c2f3ca6e488e9ddb49514e1b8e93909d5594259b..231aaa7dfd95c638c23e0a8e5a1d4f7f16f00f7b 100644
--- a/tools/testing/selftests/filesystems/nsfs/Makefile
+++ b/tools/testing/selftests/filesystems/nsfs/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
-TEST_GEN_PROGS := owner pidns
+TEST_GEN_PROGS := owner pidns iterate_mntns
 
 CFLAGS := -Wall -Werror
 
diff --git a/tools/testing/selftests/filesystems/nsfs/iterate_mntns.c b/tools/testing/selftests/filesystems/nsfs/iterate_mntns.c
new file mode 100644
index 0000000000000000000000000000000000000000..457cf76f3c5f368872292714b44c037968de4ad3
--- /dev/null
+++ b/tools/testing/selftests/filesystems/nsfs/iterate_mntns.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+// Copyright (c) 2024 Christian Brauner <brauner@kernel.org>
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/mount.h>
+#include <unistd.h>
+
+#include "../../kselftest_harness.h"
+
+#define MNT_NS_COUNT 11
+#define MNT_NS_LAST_INDEX 10
+
+struct mnt_ns_info {
+	__u32 size;
+	__u32 nr_mounts;
+	__u64 mnt_ns_id;
+};
+
+#define MNT_NS_INFO_SIZE_VER0 16 /* size of first published struct */
+
+/* Get information about namespace. */
+#define NS_MNT_GET_INFO _IOR(0xb7, 10, struct mnt_ns_info)
+/* Get next namespace. */
+#define NS_MNT_GET_NEXT _IOR(0xb7, 11, struct mnt_ns_info)
+/* Get previous namespace. */
+#define NS_MNT_GET_PREV _IOR(0xb7, 12, struct mnt_ns_info)
+
+FIXTURE(iterate_mount_namespaces) {
+	int fd_mnt_ns[MNT_NS_COUNT];
+	__u64 mnt_ns_id[MNT_NS_COUNT];
+};
+
+FIXTURE_SETUP(iterate_mount_namespaces)
+{
+	for (int i = 0; i < MNT_NS_COUNT; i++)
+		self->fd_mnt_ns[i] = -EBADF;
+
+	/*
+	 * Creating a new user namespace let's us guarantee that we only see
+	 * mount namespaces that we did actually create.
+	 */
+	ASSERT_EQ(unshare(CLONE_NEWUSER), 0);
+
+	for (int i = 0; i < MNT_NS_COUNT; i++) {
+		struct mnt_ns_info info = {};
+
+		ASSERT_EQ(unshare(CLONE_NEWNS), 0);
+		self->fd_mnt_ns[i] = open("/proc/self/ns/mnt", O_RDONLY | O_CLOEXEC);
+		ASSERT_GE(self->fd_mnt_ns[i], 0);
+		ASSERT_EQ(ioctl(self->fd_mnt_ns[i], NS_MNT_GET_INFO, &info), 0);
+		self->mnt_ns_id[i] = info.mnt_ns_id;
+	}
+}
+
+FIXTURE_TEARDOWN(iterate_mount_namespaces)
+{
+	for (int i = 0; i < MNT_NS_COUNT; i++) {
+		if (self->fd_mnt_ns[i] < 0)
+			continue;
+		ASSERT_EQ(close(self->fd_mnt_ns[i]), 0);
+	}
+}
+
+TEST_F(iterate_mount_namespaces, iterate_all_forward)
+{
+	int fd_mnt_ns_cur, count = 0;
+
+	fd_mnt_ns_cur = fcntl(self->fd_mnt_ns[0], F_DUPFD_CLOEXEC);
+	ASSERT_GE(fd_mnt_ns_cur, 0);
+
+	for (;; count++) {
+		struct mnt_ns_info info = {};
+		int fd_mnt_ns_next;
+
+		fd_mnt_ns_next = ioctl(fd_mnt_ns_cur, NS_MNT_GET_NEXT, &info);
+		if (fd_mnt_ns_next < 0 && errno == ENOENT)
+			break;
+		ASSERT_GE(fd_mnt_ns_next, 0);
+		ASSERT_EQ(close(fd_mnt_ns_cur), 0);
+		fd_mnt_ns_cur = fd_mnt_ns_next;
+	}
+	ASSERT_EQ(count, MNT_NS_LAST_INDEX);
+}
+
+TEST_F(iterate_mount_namespaces, iterate_all_backwards)
+{
+	int fd_mnt_ns_cur, count = 0;
+
+	fd_mnt_ns_cur = fcntl(self->fd_mnt_ns[MNT_NS_LAST_INDEX], F_DUPFD_CLOEXEC);
+	ASSERT_GE(fd_mnt_ns_cur, 0);
+
+	for (;; count++) {
+		struct mnt_ns_info info = {};
+		int fd_mnt_ns_prev;
+
+		fd_mnt_ns_prev = ioctl(fd_mnt_ns_cur, NS_MNT_GET_PREV, &info);
+		if (fd_mnt_ns_prev < 0 && errno == ENOENT)
+			break;
+		ASSERT_GE(fd_mnt_ns_prev, 0);
+		ASSERT_EQ(close(fd_mnt_ns_cur), 0);
+		fd_mnt_ns_cur = fd_mnt_ns_prev;
+	}
+	ASSERT_EQ(count, MNT_NS_LAST_INDEX);
+}
+
+TEST_F(iterate_mount_namespaces, iterate_forward)
+{
+	int fd_mnt_ns_cur;
+
+	ASSERT_EQ(setns(self->fd_mnt_ns[0], CLONE_NEWNS), 0);
+
+	fd_mnt_ns_cur = self->fd_mnt_ns[0];
+	for (int i = 1; i < MNT_NS_COUNT; i++) {
+		struct mnt_ns_info info = {};
+		int fd_mnt_ns_next;
+
+		fd_mnt_ns_next = ioctl(fd_mnt_ns_cur, NS_MNT_GET_NEXT, &info);
+		ASSERT_GE(fd_mnt_ns_next, 0);
+		ASSERT_EQ(close(fd_mnt_ns_cur), 0);
+		fd_mnt_ns_cur = fd_mnt_ns_next;
+		ASSERT_EQ(info.mnt_ns_id, self->mnt_ns_id[i]);
+	}
+}
+
+TEST_F(iterate_mount_namespaces, iterate_backward)
+{
+	int fd_mnt_ns_cur;
+
+	ASSERT_EQ(setns(self->fd_mnt_ns[MNT_NS_LAST_INDEX], CLONE_NEWNS), 0);
+
+	fd_mnt_ns_cur = self->fd_mnt_ns[MNT_NS_LAST_INDEX];
+	for (int i = MNT_NS_LAST_INDEX - 1; i >= 0; i--) {
+		struct mnt_ns_info info = {};
+		int fd_mnt_ns_prev;
+
+		fd_mnt_ns_prev = ioctl(fd_mnt_ns_cur, NS_MNT_GET_PREV, &info);
+		ASSERT_GE(fd_mnt_ns_prev, 0);
+		ASSERT_EQ(close(fd_mnt_ns_cur), 0);
+		fd_mnt_ns_cur = fd_mnt_ns_prev;
+		ASSERT_EQ(info.mnt_ns_id, self->mnt_ns_id[i]);
+	}
+}
+
+TEST_HARNESS_MAIN

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 09/10] selftests: remove unneeded include
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
                   ` (7 preceding siblings ...)
  2024-12-12 23:03 ` [PATCH v3 08/10] selftests: add tests for mntns iteration Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-12 23:03 ` [PATCH v3 10/10] samples: add test-list-all-mounts Christian Brauner
  2024-12-13 19:03 ` [PATCH v3 00/10] fs: lockless mntns lookup Jeff Layton
  10 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

The pidfd header will be included in a sample program and this pulls in
all the mount definitions that would be causing problems.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index 88d6830ee004df3c7a9d3ebcdab89d5775e9ab9b..3a96053e52e7bbf5f7f85908c2093e9023b1d3d6 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -12,7 +12,6 @@
 #include <stdlib.h>
 #include <string.h>
 #include <syscall.h>
-#include <sys/mount.h>
 #include <sys/types.h>
 #include <sys/wait.h>
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 10/10] samples: add test-list-all-mounts
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
                   ` (8 preceding siblings ...)
  2024-12-12 23:03 ` [PATCH v3 09/10] selftests: remove unneeded include Christian Brauner
@ 2024-12-12 23:03 ` Christian Brauner
  2024-12-13 19:03 ` [PATCH v3 00/10] fs: lockless mntns lookup Jeff Layton
  10 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2024-12-12 23:03 UTC (permalink / raw)
  To: Josef Bacik, Jeff Layton
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel,
	Christian Brauner

Add a sample program illustrating how to list all mounts in all mount
namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 samples/vfs/.gitignore             |   1 +
 samples/vfs/Makefile               |   2 +-
 samples/vfs/test-list-all-mounts.c | 235 +++++++++++++++++++++++++++++++++++++
 3 files changed, 237 insertions(+), 1 deletion(-)

diff --git a/samples/vfs/.gitignore b/samples/vfs/.gitignore
index 79212d91285bca72b0ff85f28aaccd2e803ac092..8694dd17b318768b975ece5c7cd450c2cca67318 100644
--- a/samples/vfs/.gitignore
+++ b/samples/vfs/.gitignore
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 /test-fsmount
+/test-list-all-mounts
 /test-statx
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 6377a678134acf0d682151d751d2f5042dbf5e0a..301be72a52a0e376c7ebe235cc2058992919cc78 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
-userprogs-always-y += test-fsmount test-statx
+userprogs-always-y += test-fsmount test-statx test-list-all-mounts
 
 userccflags += -I usr/include
diff --git a/samples/vfs/test-list-all-mounts.c b/samples/vfs/test-list-all-mounts.c
new file mode 100644
index 0000000000000000000000000000000000000000..f372d5aea4717fd1ab3d4b3f9af79316cd5dd3d3
--- /dev/null
+++ b/samples/vfs/test-list-all-mounts.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+// Copyright (c) 2024 Christian Brauner <brauner@kernel.org>
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <stdio.h>
+#include <sys/ioctl.h>
+#include <sys/syscall.h>
+
+#include "../../tools/testing/selftests/pidfd/pidfd.h"
+
+#define die_errno(format, ...)                                             \
+	do {                                                               \
+		fprintf(stderr, "%m | %s: %d: %s: " format "\n", __FILE__, \
+			__LINE__, __func__, ##__VA_ARGS__);                \
+		exit(EXIT_FAILURE);                                        \
+	} while (0)
+
+/* Get the id for a mount namespace */
+#define NS_GET_MNTNS_ID _IO(0xb7, 0x5)
+/* Get next mount namespace. */
+
+struct mnt_ns_info {
+	__u32 size;
+	__u32 nr_mounts;
+	__u64 mnt_ns_id;
+};
+
+#define MNT_NS_INFO_SIZE_VER0 16 /* size of first published struct */
+
+/* Get information about namespace. */
+#define NS_MNT_GET_INFO _IOR(0xb7, 10, struct mnt_ns_info)
+/* Get next namespace. */
+#define NS_MNT_GET_NEXT _IOR(0xb7, 11, struct mnt_ns_info)
+/* Get previous namespace. */
+#define NS_MNT_GET_PREV _IOR(0xb7, 12, struct mnt_ns_info)
+
+#define PIDFD_GET_MNT_NAMESPACE _IO(0xFF, 3)
+
+#ifndef __NR_listmount
+#define __NR_listmount 458
+#endif
+
+#ifndef __NR_statmount
+#define __NR_statmount 457
+#endif
+
+/* @mask bits for statmount(2) */
+#define STATMOUNT_SB_BASIC		0x00000001U /* Want/got sb_... */
+#define STATMOUNT_MNT_BASIC		0x00000002U /* Want/got mnt_... */
+#define STATMOUNT_PROPAGATE_FROM	0x00000004U /* Want/got propagate_from */
+#define STATMOUNT_MNT_ROOT		0x00000008U /* Want/got mnt_root  */
+#define STATMOUNT_MNT_POINT		0x00000010U /* Want/got mnt_point */
+#define STATMOUNT_FS_TYPE		0x00000020U /* Want/got fs_type */
+#define STATMOUNT_MNT_NS_ID		0x00000040U /* Want/got mnt_ns_id */
+#define STATMOUNT_MNT_OPTS		0x00000080U /* Want/got mnt_opts */
+
+#define STATX_MNT_ID_UNIQUE 0x00004000U /* Want/got extended stx_mount_id */
+
+struct statmount {
+	__u32 size;
+	__u32 mnt_opts;
+	__u64 mask;
+	__u32 sb_dev_major;
+	__u32 sb_dev_minor;
+	__u64 sb_magic;
+	__u32 sb_flags;
+	__u32 fs_type;
+	__u64 mnt_id;
+	__u64 mnt_parent_id;
+	__u32 mnt_id_old;
+	__u32 mnt_parent_id_old;
+	__u64 mnt_attr;
+	__u64 mnt_propagation;
+	__u64 mnt_peer_group;
+	__u64 mnt_master;
+	__u64 propagate_from;
+	__u32 mnt_root;
+	__u32 mnt_point;
+	__u64 mnt_ns_id;
+	__u64 __spare2[49];
+	char str[];
+};
+
+struct mnt_id_req {
+	__u32 size;
+	__u32 spare;
+	__u64 mnt_id;
+	__u64 param;
+	__u64 mnt_ns_id;
+};
+
+#define MNT_ID_REQ_SIZE_VER1 32 /* sizeof second published struct */
+
+#define LSMT_ROOT 0xffffffffffffffff /* root mount */
+
+static int __statmount(__u64 mnt_id, __u64 mnt_ns_id, __u64 mask,
+		       struct statmount *stmnt, size_t bufsize,
+		       unsigned int flags)
+{
+	struct mnt_id_req req = {
+		.size		= MNT_ID_REQ_SIZE_VER1,
+		.mnt_id		= mnt_id,
+		.param		= mask,
+		.mnt_ns_id	= mnt_ns_id,
+	};
+
+	return syscall(__NR_statmount, &req, stmnt, bufsize, flags);
+}
+
+static struct statmount *sys_statmount(__u64 mnt_id, __u64 mnt_ns_id,
+				       __u64 mask, unsigned int flags)
+{
+	size_t bufsize = 1 << 15;
+	struct statmount *stmnt = NULL, *tmp = NULL;
+	int ret;
+
+	for (;;) {
+		tmp = realloc(stmnt, bufsize);
+		if (!tmp)
+			goto out;
+
+		stmnt = tmp;
+		ret = __statmount(mnt_id, mnt_ns_id, mask, stmnt, bufsize, flags);
+		if (!ret)
+			return stmnt;
+
+		if (errno != EOVERFLOW)
+			goto out;
+
+		bufsize <<= 1;
+		if (bufsize >= UINT_MAX / 2)
+			goto out;
+	}
+
+out:
+	free(stmnt);
+	return NULL;
+}
+
+static ssize_t sys_listmount(__u64 mnt_id, __u64 last_mnt_id, __u64 mnt_ns_id,
+			     __u64 list[], size_t num, unsigned int flags)
+{
+	struct mnt_id_req req = {
+		.size		= MNT_ID_REQ_SIZE_VER1,
+		.mnt_id		= mnt_id,
+		.param		= last_mnt_id,
+		.mnt_ns_id	= mnt_ns_id,
+	};
+
+	return syscall(__NR_listmount, &req, list, num, flags);
+}
+
+int main(int argc, char *argv[])
+{
+#define LISTMNT_BUFFER 10
+	__u64 list[LISTMNT_BUFFER], last_mnt_id = 0;
+	int ret, pidfd, fd_mntns;
+	struct mnt_ns_info info = {};
+
+	pidfd = sys_pidfd_open(getpid(), 0);
+	if (pidfd < 0)
+		die_errno("pidfd_open failed");
+
+	fd_mntns = ioctl(pidfd, PIDFD_GET_MNT_NAMESPACE, 0);
+	if (fd_mntns < 0)
+		die_errno("ioctl(PIDFD_GET_MNT_NAMESPACE) failed");
+
+	ret = ioctl(fd_mntns, NS_MNT_GET_INFO, &info);
+	if (ret < 0)
+		die_errno("ioctl(NS_GET_MNTNS_ID) failed");
+
+	printf("Listing %u mounts for mount namespace %llu\n",
+	       info.nr_mounts, info.mnt_ns_id);
+	for (;;) {
+		ssize_t nr_mounts;
+next:
+		nr_mounts = sys_listmount(LSMT_ROOT, last_mnt_id,
+					  info.mnt_ns_id, list, LISTMNT_BUFFER,
+					  0);
+		if (nr_mounts <= 0) {
+			int fd_mntns_next;
+
+			printf("Finished listing %u mounts for mount namespace %llu\n\n",
+			       info.nr_mounts, info.mnt_ns_id);
+			fd_mntns_next = ioctl(fd_mntns, NS_MNT_GET_NEXT, &info);
+			if (fd_mntns_next < 0) {
+				if (errno == ENOENT) {
+					printf("Finished listing all mount namespaces\n");
+					exit(0);
+				}
+				die_errno("ioctl(NS_MNT_GET_NEXT) failed");
+			}
+			close(fd_mntns);
+			fd_mntns = fd_mntns_next;
+			last_mnt_id = 0;
+			printf("Listing %u mounts for mount namespace %llu\n",
+			       info.nr_mounts, info.mnt_ns_id);
+			goto next;
+		}
+
+		for (size_t cur = 0; cur < nr_mounts; cur++) {
+			struct statmount *stmnt;
+
+			last_mnt_id = list[cur];
+
+			stmnt = sys_statmount(last_mnt_id, info.mnt_ns_id,
+					      STATMOUNT_SB_BASIC |
+					      STATMOUNT_MNT_BASIC |
+					      STATMOUNT_MNT_ROOT |
+					      STATMOUNT_MNT_POINT |
+					      STATMOUNT_MNT_NS_ID |
+					      STATMOUNT_MNT_OPTS |
+					      STATMOUNT_FS_TYPE, 0);
+			if (!stmnt) {
+				printf("Failed to statmount(%llu) in mount namespace(%llu)\n",
+				       last_mnt_id, info.mnt_ns_id);
+				continue;
+			}
+
+			printf("mnt_id:\t\t%llu\nmnt_parent_id:\t%llu\nfs_type:\t%s\nmnt_root:\t%s\nmnt_point:\t%s\nmnt_opts:\t%s\n\n",
+			       stmnt->mnt_id,
+			       stmnt->mnt_parent_id,
+			       stmnt->str + stmnt->fs_type,
+			       stmnt->str + stmnt->mnt_root,
+			       stmnt->str + stmnt->mnt_point,
+			       stmnt->str + stmnt->mnt_opts);
+			free(stmnt);
+		}
+	}
+
+	exit(0);
+}

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 00/10] fs: lockless mntns lookup
  2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
                   ` (9 preceding siblings ...)
  2024-12-12 23:03 ` [PATCH v3 10/10] samples: add test-list-all-mounts Christian Brauner
@ 2024-12-13 19:03 ` Jeff Layton
  10 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2024-12-13 19:03 UTC (permalink / raw)
  To: Christian Brauner, Josef Bacik
  Cc: Paul E. McKenney, Peter Ziljstra, linux-fsdevel

On Fri, 2024-12-13 at 00:03 +0100, Christian Brauner wrote:
> Hey,
> 
> This now also includes selftests for iterating mount namespaces both
> backwards and forwards.
> 
> Currently we take the read lock when looking for a mount namespace to
> list mounts in. We can make this lockless. The simple search case can
> just use a sequence counter to detect concurrent changes to the rbtree.
> 
> For walking the list of mount namespaces sequentially via nsfs we keep a
> separate rcu list as rb_prev() and rb_next() aren't usable safely with
> rcu.
> 
> Since creating mount namespaces is a relatively rare event compared with
> querying mounts in a foreign mount namespace this is worth it. Once
> libmount and systemd pick up this mechanism to list mounts in foreign
> mount namespaces this will be used very frequently.
> 
> Thanks!
> Christian
> 
> ---
> Changes in v3:
> - Add selftests.
> - Put list_head into a union with the wait_queue_head_t for poll instead
>   of the mnt_ns_tree_node which would've risked breaking rbtree
>   traversal.
> - Handle insertion into the mount namespace list correctly by making use
>   of the rbtree position information after the mount namespace has been
>   added to it.
> - Improve the documentation for the new list_bidir_{del,prev}_rcu().
> - Link to v2: https://lore.kernel.org/r/20241212-work-mount-rbtree-lockless-v2-0-4fe6cef02534@kernel.org
> 
> Changes in v2:
> - Remove mnt_ns_find_it_at() by switching to rb_find_rcu().
> - Add separate list to lookup sequential mount namespaces.
> - Link to v1: https://lore.kernel.org/r/20241210-work-mount-rbtree-lockless-v1-0-338366b9bbe4@kernel.org
> 
> ---
> Christian Brauner (10):
>       mount: remove inlude/nospec.h include
>       fs: add mount namespace to rbtree late
>       fs: lockless mntns rbtree lookup
>       rculist: add list_bidir_{del,prev}_rcu()
>       fs: lockless mntns lookup for nsfs
>       fs: simplify rwlock to spinlock
>       seltests: move nsfs into filesystems subfolder
>       selftests: add tests for mntns iteration
>       selftests: remove unneeded include
>       samples: add test-list-all-mounts
> 
>  fs/mount.h                                         |  18 +-
>  fs/namespace.c                                     | 163 ++++++++------
>  fs/nsfs.c                                          |   5 +-
>  include/linux/rculist.h                            |  47 +++++
>  samples/vfs/.gitignore                             |   1 +
>  samples/vfs/Makefile                               |   2 +-
>  samples/vfs/test-list-all-mounts.c                 | 235 +++++++++++++++++++++
>  .../selftests/{ => filesystems}/nsfs/.gitignore    |   1 +
>  .../selftests/{ => filesystems}/nsfs/Makefile      |   4 +-
>  .../selftests/{ => filesystems}/nsfs/config        |   0
>  .../selftests/filesystems/nsfs/iterate_mntns.c     | 149 +++++++++++++
>  .../selftests/{ => filesystems}/nsfs/owner.c       |   0
>  .../selftests/{ => filesystems}/nsfs/pidns.c       |   0
>  tools/testing/selftests/pidfd/pidfd.h              |   1 -
>  14 files changed, 546 insertions(+), 80 deletions(-)
> ---
> base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
> change-id: 20241207-work-mount-rbtree-lockless-7d4071b74f18
> 

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-12-19 13:46 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-12 23:03 [PATCH v3 00/10] fs: lockless mntns lookup Christian Brauner
2024-12-12 23:03 ` [PATCH v3 01/10] mount: remove inlude/nospec.h include Christian Brauner
2024-12-12 23:03 ` [PATCH v3 02/10] fs: add mount namespace to rbtree late Christian Brauner
2024-12-12 23:03 ` [PATCH v3 03/10] fs: lockless mntns rbtree lookup Christian Brauner
2024-12-13  8:50   ` Peter Zijlstra
2024-12-13 14:11   ` Jeff Layton
2024-12-13 18:44     ` Christian Brauner
2024-12-13 19:02       ` Jeff Layton
2024-12-19  9:20   ` Lai, Yi
2024-12-19 13:45     ` Christian Brauner
2024-12-12 23:03 ` [PATCH v3 04/10] rculist: add list_bidir_{del,prev}_rcu() Christian Brauner
2024-12-13  0:42   ` Paul E. McKenney
2024-12-13 13:49     ` Christian Brauner
2024-12-12 23:03 ` [PATCH v3 05/10] fs: lockless mntns lookup for nsfs Christian Brauner
2024-12-12 23:03 ` [PATCH v3 06/10] fs: simplify rwlock to spinlock Christian Brauner
2024-12-12 23:03 ` [PATCH v3 07/10] seltests: move nsfs into filesystems subfolder Christian Brauner
2024-12-12 23:03 ` [PATCH v3 08/10] selftests: add tests for mntns iteration Christian Brauner
2024-12-12 23:03 ` [PATCH v3 09/10] selftests: remove unneeded include Christian Brauner
2024-12-12 23:03 ` [PATCH v3 10/10] samples: add test-list-all-mounts Christian Brauner
2024-12-13 19:03 ` [PATCH v3 00/10] fs: lockless mntns lookup Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.