[PATCH v5 bpf-next 0/5] bpf path iterator

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 bpf-next 0/5] bpf path iterator
@ 2025-06-17  6:11 Song Liu
  2025-06-17  6:11 ` [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent() Song Liu
                   ` (5 more replies)
  0 siblings, 6 replies; 50+ messages in thread
From: Song Liu @ 2025-06-17  6:11 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, m, neil, Song Liu

In security use cases, it is common to apply rules to VFS subtrees.
However, filtering files in a subtree is not straightforward [1].

One solution to this problem is to start from a path and walk up the VFS
tree (towards the root). Among in-tree LSMs, Landlock uses this solution.

BPF LSM solutions, such like Tetragon [2], also use similar approaches.
However, due to lack of proper helper/kfunc support, BPF LSM solutions
usually do the path walk with probe read, which is racy.

This patchset introduces a new helper path_walk_parent, which walks
path to its VFS parent. The helper is used in Landlock.

A new BPF iterator, path iterator, is introduced to do the path walking.
The BPF path iterator uses the new path_walk_parent help to walk the VFS
tree.

Changes v4 => v5:
1. Minor changes to path_walk_parent per suggestions by Neil Brown and
   Tingmao Wang.

v4: https://lore.kernel.org/bpf/20250611220220.3681382-1-song@kernel.org/

Changes v3 => v4:
1. Rewrite path_walk_parent based on suggestion from Neil Brown.
2. When path_walk_parent() returns false, it call path_put on @path and
   then zeros @path.

v3: https://lore.kernel.org/bpf/20250606213015.255134-1-song@kernel.org/

Changes v2 => v3:
1. Fix an issue with path_walk_parent.
2. Move bpf path iterator to fs/bpf_fs_kfuncs.c
3. Optimize bpf path iterator (less memory).
4. Add more selftests.
5. Add more comments.

v2: https://lore.kernel.org/bpf/20250603065920.3404510-1-song@kernel.org/

Changes v1 => v2:
1. Rename path_parent => path_walk_parent.
2. Remove path_connected check in path_walk_parent.
3. Fix is_access_to_paths_allowed().
4. Remove mode for path iterator, add a flag instead.

v1: https://lore.kernel.org/bpf/20250528222623.1373000-1-song@kernel.org/

[1] https://lpc.events/event/18/contributions/1940/
[2] https://github.com/cilium/tetragon/

Song Liu (5):
  namei: Introduce new helper function path_walk_parent()
  landlock: Use path_walk_parent()
  bpf: Introduce path iterator
  selftests/bpf: Add tests for bpf path iterator
  selftests/bpf: Path walk test

 fs/bpf_fs_kfuncs.c                            |  72 +++++++++
 fs/namei.c                                    |  95 +++++++++---
 include/linux/namei.h                         |   2 +
 kernel/bpf/verifier.c                         |   5 +
 security/landlock/fs.c                        |  30 +---
 .../testing/selftests/bpf/bpf_experimental.h  |   6 +
 .../selftests/bpf/prog_tests/path_iter.c      | 111 ++++++++++++++
 tools/testing/selftests/bpf/progs/path_iter.c | 145 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/path_walk.c |  59 +++++++
 9 files changed, 485 insertions(+), 40 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/path_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/path_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/path_walk.c

--
2.47.1

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent()
  2025-06-17  6:11 [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
@ 2025-06-17  6:11 ` Song Liu
  2025-06-18  1:02   ` kernel test robot
                     ` (2 more replies)
  2025-06-17  6:11 ` [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent() Song Liu
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 50+ messages in thread
From: Song Liu @ 2025-06-17  6:11 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, m, neil, Song Liu

This helper walks an input path to its parent. Logic are added to handle
walking across mount tree.

This will be used by landlock, and BPF LSM.

Suggested-by: Neil Brown <neil@brown.name>
Signed-off-by: Song Liu <song@kernel.org>
---
 fs/namei.c            | 95 +++++++++++++++++++++++++++++++++++--------
 include/linux/namei.h |  2 +
 2 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4bb889fc980b..d0557c0b5cc8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2048,36 +2048,95 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd)
 	return nd->path.dentry;
 }
 
-static struct dentry *follow_dotdot(struct nameidata *nd)
+/**
+ * __path_walk_parent - Find the parent of the given struct path
+ * @path  - The struct path to start from
+ * @root  - A struct path which serves as a boundary not to be crosses.
+ *        - If @root is zero'ed, walk all the way to global root.
+ * @flags - Some LOOKUP_ flags.
+ *
+ * Find and return the dentry for the parent of the given path
+ * (mount/dentry). If the given path is the root of a mounted tree, it
+ * is first updated to the mount point on which that tree is mounted.
+ *
+ * If %LOOKUP_NO_XDEV is given, then *after* the path is updated to a new
+ * mount, the error EXDEV is returned.
+ *
+ * If no parent can be found, either because the tree is not mounted or
+ * because the @path matches the @root, then @path->dentry is returned
+ * unless @flags contains %LOOKUP_BENEATH, in which case -EXDEV is returned.
+ *
+ * Returns: either an ERR_PTR() or the chosen parent which will have had
+ * the refcount incremented.
+ */
+static struct dentry *__path_walk_parent(struct path *path, const struct path *root, int flags)
 {
-	struct dentry *parent;
-
-	if (path_equal(&nd->path, &nd->root))
+	if (path_equal(path, root))
 		goto in_root;
-	if (unlikely(nd->path.dentry == nd->path.mnt->mnt_root)) {
-		struct path path;
+	if (unlikely(path->dentry == path->mnt->mnt_root)) {
+		struct path new_path;
 
-		if (!choose_mountpoint(real_mount(nd->path.mnt),
-				       &nd->root, &path))
+		if (!choose_mountpoint(real_mount(path->mnt),
+				       root, &new_path))
 			goto in_root;
-		path_put(&nd->path);
-		nd->path = path;
-		nd->inode = path.dentry->d_inode;
-		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+		path_put(path);
+		*path = new_path;
+		if (unlikely(flags & LOOKUP_NO_XDEV))
 			return ERR_PTR(-EXDEV);
 	}
 	/* rare case of legitimate dget_parent()... */
-	parent = dget_parent(nd->path.dentry);
+	return dget_parent(path->dentry);
+
+in_root:
+	if (unlikely(flags & LOOKUP_BENEATH))
+		return ERR_PTR(-EXDEV);
+	return dget(path->dentry);
+}
+
+/**
+ * path_walk_parent - Walk to the parent of path
+ * @path: input and output path.
+ * @root: root of the path walk, do not go beyond this root. If @root is
+ *        zero'ed, walk all the way to real root.
+ *
+ * Given a path, find the parent path. Replace @path with the parent path.
+ * If we were already at the real root or a disconnected root, @path is
+ * not changed.
+ *
+ * Returns:
+ *  0  - if @path is updated to its parent.
+ *  <0 - if @path is already the root (real root or @root).
+ */
+int path_walk_parent(struct path *path, const struct path *root)
+{
+	struct dentry *parent;
+
+	parent = __path_walk_parent(path, root, LOOKUP_BENEATH);
+
+	if (IS_ERR(parent))
+		return PTR_ERR(parent);
+
+	if (parent == path->dentry) {
+		dput(parent);
+		return -ENOENT;
+	}
+	dput(path->dentry);
+	path->dentry = parent;
+	return 0;
+}
+
+static struct dentry *follow_dotdot(struct nameidata *nd)
+{
+	struct dentry *parent = __path_walk_parent(&nd->path, &nd->root, nd->flags);
+
+	if (IS_ERR(parent))
+		return parent;
 	if (unlikely(!path_connected(nd->path.mnt, parent))) {
 		dput(parent);
 		return ERR_PTR(-ENOENT);
 	}
+	nd->inode = nd->path.dentry->d_inode;
 	return parent;
-
-in_root:
-	if (unlikely(nd->flags & LOOKUP_BENEATH))
-		return ERR_PTR(-EXDEV);
-	return dget(nd->path.dentry);
 }
 
 static const char *handle_dots(struct nameidata *nd, int type)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 5d085428e471..ca68fa4089e0 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -85,6 +85,8 @@ extern int follow_down_one(struct path *);
 extern int follow_down(struct path *path, unsigned int flags);
 extern int follow_up(struct path *);
 
+int path_walk_parent(struct path *path, const struct path *root);
+
 extern struct dentry *lock_rename(struct dentry *, struct dentry *);
 extern struct dentry *lock_rename_child(struct dentry *, struct dentry *);
 extern void unlock_rename(struct dentry *, struct dentry *);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent()
  2025-06-17  6:11 [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
  2025-06-17  6:11 ` [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent() Song Liu
@ 2025-06-17  6:11 ` Song Liu
  2025-07-03 18:29   ` Mickaël Salaün
  2025-06-17  6:11 ` [PATCH v5 bpf-next 3/5] bpf: Introduce path iterator Song Liu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-06-17  6:11 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, m, neil, Song Liu

Use path_walk_parent() to walk a path up to its parent.

No functional changes intended.

Signed-off-by: Song Liu <song@kernel.org>
---
 security/landlock/fs.c | 30 ++++++++----------------------
 1 file changed, 8 insertions(+), 22 deletions(-)

diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 6fee7c20f64d..e26ab8c34dd4 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -837,8 +837,8 @@ static bool is_access_to_paths_allowed(
 	 * restriction.
 	 */
 	while (true) {
-		struct dentry *parent_dentry;
 		const struct landlock_rule *rule;
+		struct path root = {};
 
 		/*
 		 * If at least all accesses allowed on the destination are
@@ -895,34 +895,20 @@ static bool is_access_to_paths_allowed(
 		/* Stops when a rule from each layer grants access. */
 		if (allowed_parent1 && allowed_parent2)
 			break;
-jump_up:
-		if (walker_path.dentry == walker_path.mnt->mnt_root) {
-			if (follow_up(&walker_path)) {
-				/* Ignores hidden mount points. */
-				goto jump_up;
-			} else {
-				/*
-				 * Stops at the real root.  Denies access
-				 * because not all layers have granted access.
-				 */
-				break;
-			}
-		}
-		if (unlikely(IS_ROOT(walker_path.dentry))) {
+
+		if (unlikely(IS_ROOT(walker_path.dentry)) &&
+		    (walker_path.mnt->mnt_flags & MNT_INTERNAL)) {
 			/*
 			 * Stops at disconnected root directories.  Only allows
 			 * access to internal filesystems (e.g. nsfs, which is
 			 * reachable through /proc/<pid>/ns/<namespace>).
 			 */
-			if (walker_path.mnt->mnt_flags & MNT_INTERNAL) {
-				allowed_parent1 = true;
-				allowed_parent2 = true;
-			}
+			allowed_parent1 = true;
+			allowed_parent2 = true;
 			break;
 		}
-		parent_dentry = dget_parent(walker_path.dentry);
-		dput(walker_path.dentry);
-		walker_path.dentry = parent_dentry;
+		if (path_walk_parent(&walker_path, &root))
+			break;
 	}
 	path_put(&walker_path);
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 bpf-next 3/5] bpf: Introduce path iterator
  2025-06-17  6:11 [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
  2025-06-17  6:11 ` [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent() Song Liu
  2025-06-17  6:11 ` [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent() Song Liu
@ 2025-06-17  6:11 ` Song Liu
  2025-06-17  6:11 ` [PATCH v5 bpf-next 4/5] selftests/bpf: Add tests for bpf " Song Liu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 50+ messages in thread
From: Song Liu @ 2025-06-17  6:11 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, m, neil, Song Liu

Introduce a path iterator, which walks a struct path toward the root.
This path iterator is based on path_walk_parent. A fixed zero'ed root
is passed to path_walk_parent(). Therefore, unless the user terminates
it earlier, the iterator will terminate at the real root.

Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/bpf_fs_kfuncs.c    | 72 +++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c |  5 +++
 2 files changed, 77 insertions(+)

diff --git a/fs/bpf_fs_kfuncs.c b/fs/bpf_fs_kfuncs.c
index 08412532db1b..888867678981 100644
--- a/fs/bpf_fs_kfuncs.c
+++ b/fs/bpf_fs_kfuncs.c
@@ -10,6 +10,7 @@
 #include <linux/fsnotify.h>
 #include <linux/file.h>
 #include <linux/mm.h>
+#include <linux/namei.h>
 #include <linux/xattr.h>
 
 __bpf_kfunc_start_defs();
@@ -324,6 +325,74 @@ __bpf_kfunc int bpf_remove_dentry_xattr(struct dentry *dentry, const char *name_
 
 __bpf_kfunc_end_defs();
 
+/* open-coded path iterator */
+struct bpf_iter_path {
+	__u64 __opaque[2];
+} __aligned(8);
+
+struct bpf_iter_path_kern {
+	struct path path;
+} __aligned(8);
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc int bpf_iter_path_new(struct bpf_iter_path *it,
+				  struct path *start,
+				  __u64 flags)
+{
+	struct bpf_iter_path_kern *kit = (void *)it;
+
+	BUILD_BUG_ON(sizeof(*kit) > sizeof(*it));
+	BUILD_BUG_ON(__alignof__(*kit) != __alignof__(*it));
+
+	if (flags) {
+		/*
+		 * _destroy() is still called when _new() fails. Zero
+		 * kit->path so that it be passed to path_put() safely.
+		 * Note: path_put() is no-op for zero'ed path.
+		 */
+		memset(&kit->path, 0, sizeof(struct path));
+		return -EINVAL;
+	}
+
+	kit->path = *start;
+	path_get(&kit->path);
+
+	return 0;
+}
+
+__bpf_kfunc struct path *bpf_iter_path_next(struct bpf_iter_path *it)
+{
+	struct bpf_iter_path_kern *kit = (void *)it;
+	struct path root = {};
+
+	/*
+	 * "root" is zero'ed. Therefore, unless the loop is explicitly
+	 * terminated, bpf_iter_path_next() will continue looping until
+	 * we've reached the global root of the VFS.
+	 *
+	 * If a root of walk is needed, the user can check "path" against
+	 * that root on each iteration.
+	 */
+	if (path_walk_parent(&kit->path, &root))
+		return NULL;
+
+	return &kit->path;
+}
+
+__bpf_kfunc void bpf_iter_path_destroy(struct bpf_iter_path *it)
+{
+	struct bpf_iter_path_kern *kit = (void *)it;
+
+	/*
+	 * kit->path might be zero'ed, but this is OK because path_put()
+	 * is no-op for zero'ed struct path
+	 */
+	path_put(&kit->path);
+}
+
+__bpf_kfunc_end_defs();
+
 BTF_KFUNCS_START(bpf_fs_kfunc_set_ids)
 BTF_ID_FLAGS(func, bpf_get_task_exe_file,
 	     KF_ACQUIRE | KF_TRUSTED_ARGS | KF_RET_NULL)
@@ -333,6 +402,9 @@ BTF_ID_FLAGS(func, bpf_get_dentry_xattr, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_get_file_xattr, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_set_dentry_xattr, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_remove_dentry_xattr, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_iter_path_new, KF_ITER_NEW | KF_TRUSTED_ARGS | KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_iter_path_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_iter_path_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
 BTF_KFUNCS_END(bpf_fs_kfunc_set_ids)
 
 static int bpf_fs_kfuncs_filter(const struct bpf_prog *prog, u32 kfunc_id)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 279a64933262..b495c3cc4095 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7101,6 +7101,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) {
 	struct sock *sk;
 };
 
+BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct path) {
+	struct dentry *dentry;
+};
+
 static bool type_is_rcu(struct bpf_verifier_env *env,
 			struct bpf_reg_state *reg,
 			const char *field_name, u32 btf_id)
@@ -7141,6 +7145,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
 				    const char *field_name, u32 btf_id)
 {
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
+	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct path));
 
 	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
 					  "__safe_trusted_or_null");
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 bpf-next 4/5] selftests/bpf: Add tests for bpf path iterator
  2025-06-17  6:11 [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
                   ` (2 preceding siblings ...)
  2025-06-17  6:11 ` [PATCH v5 bpf-next 3/5] bpf: Introduce path iterator Song Liu
@ 2025-06-17  6:11 ` Song Liu
  2025-06-17  6:11 ` [PATCH v5 bpf-next 5/5] selftests/bpf: Path walk test Song Liu
  2025-06-20 21:59 ` [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
  5 siblings, 0 replies; 50+ messages in thread
From: Song Liu @ 2025-06-17  6:11 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, m, neil, Song Liu

Add tests for bpf path iterator, including test cases similar to real
workload (call bpf_path_d_path and bpf_get_dentry_xattr), and test cases
where the verifier rejects invalid use of the iterator.

Signed-off-by: Song Liu <song@kernel.org>
---
 .../testing/selftests/bpf/bpf_experimental.h  |   6 +
 .../selftests/bpf/prog_tests/path_iter.c      |  12 ++
 tools/testing/selftests/bpf/progs/path_iter.c | 145 ++++++++++++++++++
 3 files changed, 163 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/path_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/path_iter.c

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 5e512a1d09d1..cbb759b473df 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -596,4 +596,10 @@ extern int bpf_iter_dmabuf_new(struct bpf_iter_dmabuf *it) __weak __ksym;
 extern struct dma_buf *bpf_iter_dmabuf_next(struct bpf_iter_dmabuf *it) __weak __ksym;
 extern void bpf_iter_dmabuf_destroy(struct bpf_iter_dmabuf *it) __weak __ksym;
 
+struct bpf_iter_path;
+extern int bpf_iter_path_new(struct bpf_iter_path *it, struct path *start,
+			     __u64 flags) __weak __ksym;
+extern struct path *bpf_iter_path_next(struct bpf_iter_path *it) __weak __ksym;
+extern void bpf_iter_path_destroy(struct bpf_iter_path *it) __weak __ksym;
+
 #endif
diff --git a/tools/testing/selftests/bpf/prog_tests/path_iter.c b/tools/testing/selftests/bpf/prog_tests/path_iter.c
new file mode 100644
index 000000000000..3c99c24fbd96
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/path_iter.c
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include <bpf/btf.h>
+#include "path_iter.skel.h"
+
+void test_path_iter(void)
+{
+	RUN_TESTS(path_iter);
+}
diff --git a/tools/testing/selftests/bpf/progs/path_iter.c b/tools/testing/selftests/bpf/progs/path_iter.c
new file mode 100644
index 000000000000..74d0f4e19ffa
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/path_iter.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+char _license[] SEC("license") = "GPL";
+
+char path_name[256];
+char xattr_val[64];
+
+static __always_inline void access_path_dentry(struct path *p)
+{
+	struct bpf_dynptr ptr;
+	struct dentry *dentry;
+
+	if (!p)
+		return;
+
+	bpf_dynptr_from_mem(xattr_val, sizeof(xattr_val), 0, &ptr);
+	bpf_path_d_path(p, path_name, sizeof(path_name));
+
+	dentry = p->dentry;
+	if (dentry)
+		bpf_get_dentry_xattr(dentry, "user.xattr", &ptr);
+}
+
+SEC("lsm.s/file_open")
+__success
+int BPF_PROG(open_code, struct file *f)
+{
+	struct bpf_iter_path path_it;
+	struct path *p;
+	int ret;
+
+	ret = bpf_iter_path_new(&path_it, &f->f_path, 0);
+	if (ret) {
+		bpf_iter_path_destroy(&path_it);
+		return 0;
+	}
+
+	p = bpf_iter_path_next(&path_it);
+	access_path_dentry(p);
+	bpf_iter_path_destroy(&path_it);
+
+	return 0;
+}
+
+SEC("lsm.s/file_open")
+__success
+int BPF_PROG(for_each, struct file *f)
+{
+	struct path *p;
+
+	bpf_for_each(path, p, &f->f_path, 0)
+		access_path_dentry(p);
+
+	return 0;
+}
+
+SEC("lsm.s/file_open")
+__failure __msg("Unreleased reference")
+int BPF_PROG(missing_destroy, struct file *f)
+{
+	struct bpf_iter_path path_it;
+
+	bpf_iter_path_new(&path_it, &f->f_path, 0);
+
+	return 0;
+}
+
+SEC("lsm.s/file_open")
+__failure __msg("expected an initialized iter_path")
+int BPF_PROG(missing_new, struct file *f)
+{
+	struct bpf_iter_path path_it;
+
+	bpf_iter_path_destroy(&path_it);
+	return 0;
+}
+
+SEC("lsm.s/file_open")
+__failure __msg("expected uninitialized iter_path")
+int BPF_PROG(new_twice, struct file *f)
+{
+	struct bpf_iter_path path_it;
+
+	bpf_iter_path_new(&path_it, &f->f_path, 0);
+	bpf_iter_path_new(&path_it, &f->f_path, 0);
+	bpf_iter_path_destroy(&path_it);
+	return 0;
+}
+
+SEC("lsm.s/file_open")
+__failure __msg("expected an initialized iter_path")
+int BPF_PROG(destroy_twice, struct file *f)
+{
+	struct bpf_iter_path path_it;
+
+	bpf_iter_path_new(&path_it, &f->f_path, 0);
+	bpf_iter_path_destroy(&path_it);
+	bpf_iter_path_destroy(&path_it);
+	return 0;
+}
+
+SEC("lsm.s/file_open")
+__success
+int BPF_PROG(reuse_path_iter, struct file *f)
+{
+	struct bpf_iter_path path_it;
+
+	bpf_iter_path_new(&path_it, &f->f_path, 0);
+	bpf_iter_path_destroy(&path_it);
+	bpf_iter_path_new(&path_it, &f->f_path, 0);
+	bpf_iter_path_destroy(&path_it);
+	return 0;
+}
+
+SEC("lsm.s/file_open")
+__failure __msg("invalid read from stack off")
+int BPF_PROG(invalid_read_path_iter, struct file *f)
+{
+	struct bpf_iter_path path_it;
+	struct bpf_iter_path path_it_2;
+
+
+	bpf_iter_path_new(&path_it, &f->f_path, 0);
+	path_it_2 = path_it;
+	bpf_iter_path_destroy(&path_it_2);
+	return 0;
+}
+
+SEC("lsm.s/sb_alloc_security")
+__failure __msg("must be referenced or trusted")
+int BPF_PROG(untrusted_path, struct super_block *sb)
+{
+	struct bpf_iter_path path_it;
+
+	bpf_iter_path_new(&path_it, &sb->s_bdev_file->f_path, 0);
+	bpf_iter_path_destroy(&path_it);
+	return 0;
+}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v5 bpf-next 5/5] selftests/bpf: Path walk test
  2025-06-17  6:11 [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
                   ` (3 preceding siblings ...)
  2025-06-17  6:11 ` [PATCH v5 bpf-next 4/5] selftests/bpf: Add tests for bpf " Song Liu
@ 2025-06-17  6:11 ` Song Liu
  2025-06-20 21:59 ` [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
  5 siblings, 0 replies; 50+ messages in thread
From: Song Liu @ 2025-06-17  6:11 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, m, neil, Song Liu

Add an end-to-end test with path_iter on security hook file_open.

A test file is created in folder /tmp/test_progs_path_iter/folder. On
file_open, walk file->f_path up to its parent and grand parent, and test
bpf_get_dentry_xattr and bpf_path_d_path on the folders.

Signed-off-by: Song Liu <song@kernel.org>
---
 .../selftests/bpf/prog_tests/path_iter.c      | 99 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/path_walk.c | 59 +++++++++++
 2 files changed, 158 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/path_walk.c

diff --git a/tools/testing/selftests/bpf/prog_tests/path_iter.c b/tools/testing/selftests/bpf/prog_tests/path_iter.c
index 3c99c24fbd96..b9772026fbf7 100644
--- a/tools/testing/selftests/bpf/prog_tests/path_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/path_iter.c
@@ -2,11 +2,110 @@
 /* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
 
 #include <test_progs.h>
+#include <fcntl.h>
 #include <bpf/libbpf.h>
 #include <bpf/btf.h>
+#include <sys/stat.h>
+#include <sys/xattr.h>
+
 #include "path_iter.skel.h"
+#include "path_walk.skel.h"
+
+static const char grand_parent_path[] = "/tmp/test_progs_path_iter";
+static const char parent_path[] = "/tmp/test_progs_path_iter/folder";
+static const char file_path[] = "/tmp/test_progs_path_iter/folder/file";
+static const char xattr_name[] = "user.bpf.selftests";
+static const char xattr_value[] = "selftest_path_iter";
+
+static void cleanup_files(void)
+{
+	remove(file_path);
+	rmdir(parent_path);
+	rmdir(grand_parent_path);
+}
+
+static int setup_files_and_xattrs(void)
+{
+	int ret = -1;
+
+	/* create test folders */
+	if (mkdir(grand_parent_path, 0755))
+		goto error;
+	if (mkdir(parent_path, 0755))
+		goto error;
+
+	/* setxattr for test folders */
+	ret = setxattr(grand_parent_path, xattr_name,
+		       xattr_value, sizeof(xattr_value), 0);
+	if (ret < 0) {
+		/* return errno, so that we can handle EOPNOTSUPP in the caller */
+		ret = errno;
+		goto error;
+	}
+	ret = setxattr(parent_path, xattr_name,
+		       xattr_value, sizeof(xattr_value), 0);
+	if (ret < 0) {
+		/* return errno, so that we can handle EOPNOTSUPP in the caller */
+		ret = errno;
+		goto error;
+	}
+
+	return 0;
+error:
+	cleanup_files();
+	return ret;
+}
+
+static void test_path_walk(void)
+{
+	struct path_walk *skel = NULL;
+	int file_fd;
+	int err;
+
+	err = setup_files_and_xattrs();
+	if (err == EOPNOTSUPP) {
+		printf("%s:SKIP:local fs doesn't support xattr (%d)\n"
+		       "To run this test, make sure /tmp filesystem supports xattr.\n",
+		       __func__, errno);
+		test__skip();
+		return;
+	}
+
+	if (!ASSERT_OK(err, "setup_file"))
+		return;
+
+	skel = path_walk__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "path_walk__open_and_load"))
+		goto cleanup;
+
+	skel->bss->monitored_pid = getpid();
+	if (!ASSERT_OK(path_walk__attach(skel), "path_walk__attach"))
+		goto cleanup;
+
+	file_fd = open(file_path, O_CREAT);
+	if (!ASSERT_OK_FD(file_fd, "open_file"))
+		goto cleanup;
+	close(file_fd);
+
+	ASSERT_OK(strncmp(skel->bss->parent_xattr_buf, xattr_value, strlen(xattr_value)),
+		  "parent_xattr");
+	ASSERT_OK(strncmp(skel->bss->grand_parent_xattr_buf, xattr_value, strlen(xattr_value)),
+		  "grand_parent_xattr");
+
+	ASSERT_OK(strncmp(skel->bss->parent_path_buf, parent_path, strlen(parent_path)),
+		  "parent_d_path");
+	ASSERT_OK(strncmp(skel->bss->grand_parent_path_buf, grand_parent_path,
+			  strlen(grand_parent_path)),
+		  "grand_parent_d_path");
+
+cleanup:
+	path_walk__destroy(skel);
+	cleanup_files();
+}
 
 void test_path_iter(void)
 {
 	RUN_TESTS(path_iter);
+	if (test__start_subtest("path_walk_example"))
+		test_path_walk();
 }
diff --git a/tools/testing/selftests/bpf/progs/path_walk.c b/tools/testing/selftests/bpf/progs/path_walk.c
new file mode 100644
index 000000000000..1e1ae82b47a2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/path_walk.c
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <errno.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_kfuncs.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+__u32 monitored_pid;
+
+#define BUF_SIZE 1024
+char parent_path_buf[BUF_SIZE] = {};
+char parent_xattr_buf[BUF_SIZE] = {};
+char grand_parent_path_buf[BUF_SIZE] = {};
+char grand_parent_xattr_buf[BUF_SIZE] = {};
+
+static __always_inline void d_path_and_read_xattr(struct path *p, char *path, char *xattr)
+{
+	struct bpf_dynptr ptr;
+	struct dentry *dentry;
+
+	if (!p)
+		return;
+	bpf_path_d_path(p, path, BUF_SIZE);
+	bpf_dynptr_from_mem(xattr, BUF_SIZE, 0, &ptr);
+	dentry = p->dentry;
+	if (dentry)
+		bpf_get_dentry_xattr(dentry, "user.bpf.selftests", &ptr);
+}
+
+SEC("lsm.s/file_open")
+int BPF_PROG(test_file_open, struct file *f)
+{
+	__u32 pid = bpf_get_current_pid_tgid() >> 32;
+	struct bpf_iter_path path_it;
+	struct path *p;
+
+	if (pid != monitored_pid)
+		return 0;
+
+	bpf_iter_path_new(&path_it, &f->f_path, 0);
+
+	/* Get d_path and xattr for the parent directory */
+	p = bpf_iter_path_next(&path_it);
+	d_path_and_read_xattr(p, parent_path_buf, parent_xattr_buf);
+
+	/* Get d_path and xattr for the grand parent directory */
+	p = bpf_iter_path_next(&path_it);
+	d_path_and_read_xattr(p, grand_parent_path_buf, grand_parent_xattr_buf);
+
+	bpf_iter_path_destroy(&path_it);
+
+	return 0;
+}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent()
  2025-06-17  6:11 ` [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent() Song Liu
@ 2025-06-18  1:02   ` kernel test robot
  2025-06-24 12:18   ` Jan Kara
  2025-07-04 17:40   ` Yonghong Song
  2 siblings, 0 replies; 50+ messages in thread
From: kernel test robot @ 2025-06-18  1:02 UTC (permalink / raw)
  To: Song Liu, bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: oe-kbuild-all, kernel-team, andrii, eddyz87, ast, daniel,
	martin.lau, viro, brauner, jack, kpsingh, mattbobrowski, m, neil,
	Song Liu

Hi Song,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Song-Liu/namei-Introduce-new-helper-function-path_walk_parent/20250617-141322
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250617061116.3681325-2-song%40kernel.org
patch subject: [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent()
config: loongarch-randconfig-r072-20250618 (https://download.01.org/0day-ci/archive/20250618/202506180814.GoByWn1r-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 12.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250618/202506180814.GoByWn1r-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506180814.GoByWn1r-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: fs/namei.c:2072 function parameter 'path' not described in '__path_walk_parent'
>> Warning: fs/namei.c:2072 function parameter 'root' not described in '__path_walk_parent'
>> Warning: fs/namei.c:2072 function parameter 'flags' not described in '__path_walk_parent'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-17  6:11 [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
                   ` (4 preceding siblings ...)
  2025-06-17  6:11 ` [PATCH v5 bpf-next 5/5] selftests/bpf: Path walk test Song Liu
@ 2025-06-20 21:59 ` Song Liu
  2025-06-24 18:45   ` Mickaël Salaün
  5 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-06-20 21:59 UTC (permalink / raw)
  To: bpf, linux-fsdevel, linux-kernel, linux-security-module, brauner,
	Mickaël Salaün
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro, jack,
	kpsingh, mattbobrowski, m, neil

Hi Christian, Mickaël, and folks,

Could you please share your comments on this version? Does this
look sane?

Thanks,
Song

On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote:
>
> In security use cases, it is common to apply rules to VFS subtrees.
> However, filtering files in a subtree is not straightforward [1].
>
> One solution to this problem is to start from a path and walk up the VFS
> tree (towards the root). Among in-tree LSMs, Landlock uses this solution.
>

[...]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent()
  2025-06-17  6:11 ` [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent() Song Liu
  2025-06-18  1:02   ` kernel test robot
@ 2025-06-24 12:18   ` Jan Kara
  2025-06-24 17:37     ` Song Liu
  2025-07-04 17:40   ` Yonghong Song
  2 siblings, 1 reply; 50+ messages in thread
From: Jan Kara @ 2025-06-24 12:18 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-fsdevel, linux-kernel, linux-security-module,
	kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, m, neil

On Mon 16-06-25 23:11:12, Song Liu wrote:
> This helper walks an input path to its parent. Logic are added to handle
> walking across mount tree.
> 
> This will be used by landlock, and BPF LSM.
> 
> Suggested-by: Neil Brown <neil@brown.name>
> Signed-off-by: Song Liu <song@kernel.org>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

One note below:

> -static struct dentry *follow_dotdot(struct nameidata *nd)
> +/**
> + * __path_walk_parent - Find the parent of the given struct path
> + * @path  - The struct path to start from
> + * @root  - A struct path which serves as a boundary not to be crosses.
> + *        - If @root is zero'ed, walk all the way to global root.
> + * @flags - Some LOOKUP_ flags.
> + *
> + * Find and return the dentry for the parent of the given path
> + * (mount/dentry). If the given path is the root of a mounted tree, it
> + * is first updated to the mount point on which that tree is mounted.
> + *
> + * If %LOOKUP_NO_XDEV is given, then *after* the path is updated to a new
> + * mount, the error EXDEV is returned.
> + *
> + * If no parent can be found, either because the tree is not mounted or
> + * because the @path matches the @root, then @path->dentry is returned
> + * unless @flags contains %LOOKUP_BENEATH, in which case -EXDEV is returned.
> + *
> + * Returns: either an ERR_PTR() or the chosen parent which will have had
> + * the refcount incremented.
> + */

The behavior with LOOKUP_NO_XDEV is kind of odd (not your fault) and
interestingly I wasn't able to find a place that would depend on the path
being updated in that case. So either I'm missing some subtle detail (quite
possible) or we can clean that up in the future.

								Honza

> +static struct dentry *__path_walk_parent(struct path *path, const struct path *root, int flags)
>  {
> -	struct dentry *parent;
> -
> -	if (path_equal(&nd->path, &nd->root))
> +	if (path_equal(path, root))
>  		goto in_root;
> -	if (unlikely(nd->path.dentry == nd->path.mnt->mnt_root)) {
> -		struct path path;
> +	if (unlikely(path->dentry == path->mnt->mnt_root)) {
> +		struct path new_path;
>  
> -		if (!choose_mountpoint(real_mount(nd->path.mnt),
> -				       &nd->root, &path))
> +		if (!choose_mountpoint(real_mount(path->mnt),
> +				       root, &new_path))
>  			goto in_root;
> -		path_put(&nd->path);
> -		nd->path = path;
> -		nd->inode = path.dentry->d_inode;
> -		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
> +		path_put(path);
> +		*path = new_path;
> +		if (unlikely(flags & LOOKUP_NO_XDEV))
>  			return ERR_PTR(-EXDEV);
>  	}
>  	/* rare case of legitimate dget_parent()... */
> -	parent = dget_parent(nd->path.dentry);
> +	return dget_parent(path->dentry);
> +
> +in_root:
> +	if (unlikely(flags & LOOKUP_BENEATH))
> +		return ERR_PTR(-EXDEV);
> +	return dget(path->dentry);
> +}
> +
> +/**
> + * path_walk_parent - Walk to the parent of path
> + * @path: input and output path.
> + * @root: root of the path walk, do not go beyond this root. If @root is
> + *        zero'ed, walk all the way to real root.
> + *
> + * Given a path, find the parent path. Replace @path with the parent path.
> + * If we were already at the real root or a disconnected root, @path is
> + * not changed.
> + *
> + * Returns:
> + *  0  - if @path is updated to its parent.
> + *  <0 - if @path is already the root (real root or @root).
> + */
> +int path_walk_parent(struct path *path, const struct path *root)
> +{
> +	struct dentry *parent;
> +
> +	parent = __path_walk_parent(path, root, LOOKUP_BENEATH);
> +
> +	if (IS_ERR(parent))
> +		return PTR_ERR(parent);
> +
> +	if (parent == path->dentry) {
> +		dput(parent);
> +		return -ENOENT;
> +	}
> +	dput(path->dentry);
> +	path->dentry = parent;
> +	return 0;
> +}
> +
> +static struct dentry *follow_dotdot(struct nameidata *nd)
> +{
> +	struct dentry *parent = __path_walk_parent(&nd->path, &nd->root, nd->flags);
> +
> +	if (IS_ERR(parent))
> +		return parent;
>  	if (unlikely(!path_connected(nd->path.mnt, parent))) {
>  		dput(parent);
>  		return ERR_PTR(-ENOENT);
>  	}
> +	nd->inode = nd->path.dentry->d_inode;
>  	return parent;
> -
> -in_root:
> -	if (unlikely(nd->flags & LOOKUP_BENEATH))
> -		return ERR_PTR(-EXDEV);
> -	return dget(nd->path.dentry);
>  }
>  
>  static const char *handle_dots(struct nameidata *nd, int type)
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 5d085428e471..ca68fa4089e0 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -85,6 +85,8 @@ extern int follow_down_one(struct path *);
>  extern int follow_down(struct path *path, unsigned int flags);
>  extern int follow_up(struct path *);
>  
> +int path_walk_parent(struct path *path, const struct path *root);
> +
>  extern struct dentry *lock_rename(struct dentry *, struct dentry *);
>  extern struct dentry *lock_rename_child(struct dentry *, struct dentry *);
>  extern void unlock_rename(struct dentry *, struct dentry *);
> -- 
> 2.47.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent()
  2025-06-24 12:18   ` Jan Kara
@ 2025-06-24 17:37     ` Song Liu
  2025-06-25 10:30       ` Jan Kara
  0 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-06-24 17:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: bpf, linux-fsdevel, linux-kernel, linux-security-module,
	kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, kpsingh, mattbobrowski, m, neil

Hi Jan,

On Tue, Jun 24, 2025 at 5:18 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 16-06-25 23:11:12, Song Liu wrote:
> > This helper walks an input path to its parent. Logic are added to handle
> > walking across mount tree.
> >
> > This will be used by landlock, and BPF LSM.
> >
> > Suggested-by: Neil Brown <neil@brown.name>
> > Signed-off-by: Song Liu <song@kernel.org>
>
> Looks good to me. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks for the review!

[...]

> > + *
> > + * Returns: either an ERR_PTR() or the chosen parent which will have had
> > + * the refcount incremented.
> > + */
>
> The behavior with LOOKUP_NO_XDEV is kind of odd (not your fault) and
> interestingly I wasn't able to find a place that would depend on the path
> being updated in that case. So either I'm missing some subtle detail (quite
> possible) or we can clean that up in the future.

We have RESOLVE_NO_XDEV in uapi/linux/openat2.h, so I guess we
cannot really remove it?

Thanks,
Song

[...]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-20 21:59 ` [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
@ 2025-06-24 18:45   ` Mickaël Salaün
  2025-06-24 21:38     ` NeilBrown
  2025-07-03  5:04     ` Song Liu
  0 siblings, 2 replies; 50+ messages in thread
From: Mickaël Salaün @ 2025-06-24 18:45 UTC (permalink / raw)
  To: Song Liu
  Cc: bpf, linux-fsdevel, linux-kernel, linux-security-module, brauner,
	kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro, jack,
	kpsingh, mattbobrowski, m, neil, Günther Noack

On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote:
> Hi Christian, Mickaël, and folks,
> 
> Could you please share your comments on this version? Does this
> look sane?

This looks good to me but we need to know what is the acceptable next
step to support RCU.  If we can go with another _rcu helper, I'm good
with the current approach, otherwise we need to figure out a way to
leverage the current helper to make it compatible with callers being in
a RCU read-side critical section while leveraging safe path walk (i.e.
several calls to path_walk_parent).

> 
> Thanks,
> Song
> 
> On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote:
> >
> > In security use cases, it is common to apply rules to VFS subtrees.
> > However, filtering files in a subtree is not straightforward [1].
> >
> > One solution to this problem is to start from a path and walk up the VFS
> > tree (towards the root). Among in-tree LSMs, Landlock uses this solution.
> >
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-24 18:45   ` Mickaël Salaün
@ 2025-06-24 21:38     ` NeilBrown
  2025-06-25 13:14       ` Mickaël Salaün
  2025-07-03  5:04     ` Song Liu
  1 sibling, 1 reply; 50+ messages in thread
From: NeilBrown @ 2025-06-24 21:38 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Song Liu, bpf, linux-fsdevel, linux-kernel, linux-security-module,
	brauner, kernel-team, andrii, eddyz87, ast, daniel, martin.lau,
	viro, jack, kpsingh, mattbobrowski, m, Günther Noack

On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote:
> > Hi Christian, Mickaël, and folks,
> > 
> > Could you please share your comments on this version? Does this
> > look sane?
> 
> This looks good to me but we need to know what is the acceptable next
> step to support RCU.  If we can go with another _rcu helper, I'm good
> with the current approach, otherwise we need to figure out a way to
> leverage the current helper to make it compatible with callers being in
> a RCU read-side critical section while leveraging safe path walk (i.e.
> several calls to path_walk_parent).

Can you spell out the minimum that you need?

My vague impression is that you want to search up from a given strut path,
no further then some other given path, looking for a dentry that matches
some rule.  Is that correct?

In general, the original dentry could be moved away from under the
dentry you find moments after the match is reported.  What mechanisms do
you have in place to ensure this doesn't happen, or that it doesn't
matter?

Would it be sufficient to have an iterator which reported successive
ancestors in turn, or reported that you need to restart because something
changed?  Would you need to know that a restart happened or would it be
acceptable to transparently start again at the parent of the starting
point?

Or do you really need a "one step at a time" interface?

Do you need more complex movements around the tree, or is just walking
up sufficient?

If this has been discussed or documented elsewhere I'd be happy for you
just to provide a reference, and I can come back with follow-up
questions if needed.

Thanks,
NeilBrown

> 
> > 
> > Thanks,
> > Song
> > 
> > On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote:
> > >
> > > In security use cases, it is common to apply rules to VFS subtrees.
> > > However, filtering files in a subtree is not straightforward [1].
> > >
> > > One solution to this problem is to start from a path and walk up the VFS
> > > tree (towards the root). Among in-tree LSMs, Landlock uses this solution.
> > >
> > 
> > [...]
> > 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent()
  2025-06-24 17:37     ` Song Liu
@ 2025-06-25 10:30       ` Jan Kara
  0 siblings, 0 replies; 50+ messages in thread
From: Jan Kara @ 2025-06-25 10:30 UTC (permalink / raw)
  To: Song Liu
  Cc: Jan Kara, bpf, linux-fsdevel, linux-kernel, linux-security-module,
	kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, kpsingh, mattbobrowski, m, neil

On Tue 24-06-25 10:37:36, Song Liu wrote:
> On Tue, Jun 24, 2025 at 5:18 AM Jan Kara <jack@suse.cz> wrote:
> > > + *
> > > + * Returns: either an ERR_PTR() or the chosen parent which will have had
> > > + * the refcount incremented.
> > > + */
> >
> > The behavior with LOOKUP_NO_XDEV is kind of odd (not your fault) and
> > interestingly I wasn't able to find a place that would depend on the path
> > being updated in that case. So either I'm missing some subtle detail (quite
> > possible) or we can clean that up in the future.
> 
> We have RESOLVE_NO_XDEV in uapi/linux/openat2.h, so I guess we
> cannot really remove it?

I didn't mean to remove the LOOKUP_NO_XDEV flag, I meant to not update the
passed path if LOOKUP_NO_XDEV is set, we are crossing the mountpoint and
thus returning -EXDEV. As far as I've checked once we return error,
everybody just path_put()s the nd->path so its update is just pointless.
But there are many (indirect) callers so I might have missed some case.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-24 21:38     ` NeilBrown
@ 2025-06-25 13:14       ` Mickaël Salaün
  2025-06-25 23:04         ` NeilBrown
  0 siblings, 1 reply; 50+ messages in thread
From: Mickaël Salaün @ 2025-06-25 13:14 UTC (permalink / raw)
  To: NeilBrown
  Cc: Song Liu, bpf, linux-fsdevel, linux-kernel, linux-security-module,
	brauner, kernel-team, andrii, eddyz87, ast, daniel, martin.lau,
	viro, jack, kpsingh, mattbobrowski, Tingmao Wang,
	Günther Noack

On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
> On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> > On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote:
> > > Hi Christian, Mickaël, and folks,
> > > 
> > > Could you please share your comments on this version? Does this
> > > look sane?
> > 
> > This looks good to me but we need to know what is the acceptable next
> > step to support RCU.  If we can go with another _rcu helper, I'm good
> > with the current approach, otherwise we need to figure out a way to
> > leverage the current helper to make it compatible with callers being in
> > a RCU read-side critical section while leveraging safe path walk (i.e.
> > several calls to path_walk_parent).
> 
> Can you spell out the minimum that you need?

Sure.  We'd like to call this new helper in a RCU
read-side critical section and leverage this capability to speed up path
walk when there is no concurrent hierarchy modification.  This use case
is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().

The main issue with this approach is to keep some state of the path walk
to know if the next call to "path_walk_parent_rcu()" would be valid
(i.e. something like a very light version of nameidata, mainly sequence
integers), and to get back to the non-RCU version otherwise.

> 
> My vague impression is that you want to search up from a given strut path,
> no further then some other given path, looking for a dentry that matches
> some rule.  Is that correct?

Yes

> 
> In general, the original dentry could be moved away from under the
> dentry you find moments after the match is reported.  What mechanisms do
> you have in place to ensure this doesn't happen, or that it doesn't
> matter?

In the case of Landlock, by default, a set of access rights are denied
and can only be allowed by an element in the file hierarchy.  The goal
is to only allow access to files under a specific directory (or directly
a specific file).  That's why we only care of the file hierarchy at the
time of access check.  It's not an issue if the file/directory was
moved or is being moved as long as we can walk its "current" hierarchy.
Furthermore, a sandboxed process is restricted from doing arbitrary
mounts (and renames/links are controlled with the
LANDLOCK_ACCESS_FS_REFER right).

However, we need to get a valid "snapshot" of the set of dentries that
(could) lead to the evaluated file/directory.

> 
> Would it be sufficient to have an iterator which reported successive
> ancestors in turn, or reported that you need to restart because something
> changed?  Would you need to know that a restart happened or would it be
> acceptable to transparently start again at the parent of the starting
> point?

If the path walk is being invalidated, we need to reset the collected
access right and start again the path walk to get all the access rights
from a consistent/real file hierarchy.

> 
> Or do you really need a "one step at a time" interface?

We need to check each component of the path walk, so either we call an
helper to get each of them and we do our check after that (we should be
able to do that in RCU), or we provide a callback function which is
called by the path walk helper.

> 
> Do you need more complex movements around the tree, or is just walking
> up sufficient?

Just walking up.

> 
> If this has been discussed or documented elsewhere I'd be happy for you
> just to provide a reference, and I can come back with follow-up
> questions if needed.

Tingmao initially described the goal here:
https://lore.kernel.org/all/afe77383-fe56-4029-848e-1401e3297139@maowtm.org/

and she sent an RFC to illustrate that:
https://lore.kernel.org/all/cover.1748997840.git.m@maowtm.org/

The discussion mainly raised two questions:
- Should we have one or two APIs?
- How to store the state of the walk without exposing VFS internals to
  the rest of the kernel?

Thanks

> 
> Thanks,
> NeilBrown
> 
> 
> > 
> > > 
> > > Thanks,
> > > Song
> > > 
> > > On Mon, Jun 16, 2025 at 11:11 PM Song Liu <song@kernel.org> wrote:
> > > >
> > > > In security use cases, it is common to apply rules to VFS subtrees.
> > > > However, filtering files in a subtree is not straightforward [1].
> > > >
> > > > One solution to this problem is to start from a path and walk up the VFS
> > > > tree (towards the root). Among in-tree LSMs, Landlock uses this solution.
> > > >
> > > 
> > > [...]
> > > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-25 13:14       ` Mickaël Salaün
@ 2025-06-25 23:04         ` NeilBrown
  2025-06-25 23:17           ` Song Liu
  2025-06-26  0:07           ` Tingmao Wang
  0 siblings, 2 replies; 50+ messages in thread
From: NeilBrown @ 2025-06-25 23:04 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Song Liu, bpf, linux-fsdevel, linux-kernel, linux-security-module,
	brauner, kernel-team, andrii, eddyz87, ast, daniel, martin.lau,
	viro, jack, kpsingh, mattbobrowski, Tingmao Wang,
	Günther Noack

On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
> > 
> > Can you spell out the minimum that you need?
> 
> Sure.  We'd like to call this new helper in a RCU
> read-side critical section and leverage this capability to speed up path
> walk when there is no concurrent hierarchy modification.  This use case
> is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().
> 
> The main issue with this approach is to keep some state of the path walk
> to know if the next call to "path_walk_parent_rcu()" would be valid
> (i.e. something like a very light version of nameidata, mainly sequence
> integers), and to get back to the non-RCU version otherwise.
> 
> > 
> > My vague impression is that you want to search up from a given strut path,
> > no further then some other given path, looking for a dentry that matches
> > some rule.  Is that correct?
> 
> Yes
> 
> > 
> > In general, the original dentry could be moved away from under the
> > dentry you find moments after the match is reported.  What mechanisms do
> > you have in place to ensure this doesn't happen, or that it doesn't
> > matter?
> 
> In the case of Landlock, by default, a set of access rights are denied
> and can only be allowed by an element in the file hierarchy.  The goal
> is to only allow access to files under a specific directory (or directly
> a specific file).  That's why we only care of the file hierarchy at the
> time of access check.  It's not an issue if the file/directory was
> moved or is being moved as long as we can walk its "current" hierarchy.
> Furthermore, a sandboxed process is restricted from doing arbitrary
> mounts (and renames/links are controlled with the
> LANDLOCK_ACCESS_FS_REFER right).
> 
> However, we need to get a valid "snapshot" of the set of dentries that
> (could) lead to the evaluated file/directory.

A "snapshot" is an interesting idea - though looking at the landlock
code you one need inodes, not dentries.
I imagine an interface where you give it a starting path, a root, and
and array of inode pointers, and it fills in the pointers with the path
- all under rcu so no references are needed.
But you would need some fallback if the array isn't big enough, so maybe
that isn't a good idea.

Based on the comments by Al and Christian, I think the only viable
approach is to pass a callback to some vfs function that does the
walking.

   vfs_walk_ancestors(struct path *path, struct path *root,
		      int (*walk_cb)(struct path *ancestor, void *data),
		      void *data)

where walk_cb() returns a negative number if it wants to abort, and is
given a NULL ancestor if vfs_walk_ancestors() needed to restart.

vfs_walk_ancestors() would initialise a "struct nameidata" and
effectively call handle_dots(&nd, LAST_DOTDOT) repeatedly, calling
    walk_cb(&nd.path, data)
each time.

How would you feel about that sort of interface?

NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-25 23:04         ` NeilBrown
@ 2025-06-25 23:17           ` Song Liu
  2025-06-26  0:07           ` Tingmao Wang
  1 sibling, 0 replies; 50+ messages in thread
From: Song Liu @ 2025-06-25 23:17 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mickaël Salaün, bpf, linux-fsdevel, linux-kernel,
	linux-security-module, brauner, kernel-team, andrii, eddyz87, ast,
	daniel, martin.lau, viro, jack, kpsingh, mattbobrowski,
	Tingmao Wang, Günther Noack

On Wed, Jun 25, 2025 at 4:05 PM NeilBrown <neil@brown.name> wrote:
>
> On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> > On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
> > >
> > > Can you spell out the minimum that you need?
> >
> > Sure.  We'd like to call this new helper in a RCU
> > read-side critical section and leverage this capability to speed up path
> > walk when there is no concurrent hierarchy modification.  This use case
> > is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().
> >
> > The main issue with this approach is to keep some state of the path walk
> > to know if the next call to "path_walk_parent_rcu()" would be valid
> > (i.e. something like a very light version of nameidata, mainly sequence
> > integers), and to get back to the non-RCU version otherwise.
> >
> > >
> > > My vague impression is that you want to search up from a given strut path,
> > > no further then some other given path, looking for a dentry that matches
> > > some rule.  Is that correct?
> >
> > Yes
> >
> > >
> > > In general, the original dentry could be moved away from under the
> > > dentry you find moments after the match is reported.  What mechanisms do
> > > you have in place to ensure this doesn't happen, or that it doesn't
> > > matter?
> >
> > In the case of Landlock, by default, a set of access rights are denied
> > and can only be allowed by an element in the file hierarchy.  The goal
> > is to only allow access to files under a specific directory (or directly
> > a specific file).  That's why we only care of the file hierarchy at the
> > time of access check.  It's not an issue if the file/directory was
> > moved or is being moved as long as we can walk its "current" hierarchy.
> > Furthermore, a sandboxed process is restricted from doing arbitrary
> > mounts (and renames/links are controlled with the
> > LANDLOCK_ACCESS_FS_REFER right).
> >
> > However, we need to get a valid "snapshot" of the set of dentries that
> > (could) lead to the evaluated file/directory.
>
> A "snapshot" is an interesting idea - though looking at the landlock
> code you one need inodes, not dentries.
> I imagine an interface where you give it a starting path, a root, and
> and array of inode pointers, and it fills in the pointers with the path
> - all under rcu so no references are needed.
> But you would need some fallback if the array isn't big enough, so maybe
> that isn't a good idea.
>
> Based on the comments by Al and Christian, I think the only viable
> approach is to pass a callback to some vfs function that does the
> walking.
>
>    vfs_walk_ancestors(struct path *path, struct path *root,
>                       int (*walk_cb)(struct path *ancestor, void *data),
>                       void *data)

I like this idea.

Maybe we want "struct path *ancestor" of walk_cb to be const.
walk_cb should only change "data", so that we can undo all the
changes when the rcu walk fails.

Thanks,
Song

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-25 23:04         ` NeilBrown
  2025-06-25 23:17           ` Song Liu
@ 2025-06-26  0:07           ` Tingmao Wang
  2025-06-26  1:05             ` NeilBrown
  1 sibling, 1 reply; 50+ messages in thread
From: Tingmao Wang @ 2025-06-26  0:07 UTC (permalink / raw)
  To: NeilBrown, Mickaël Salaün
  Cc: Song Liu, bpf, linux-fsdevel, linux-kernel, linux-security-module,
	brauner, kernel-team, andrii, eddyz87, ast, daniel, martin.lau,
	viro, jack, kpsingh, mattbobrowski, Günther Noack

On 6/26/25 00:04, NeilBrown wrote:
> On Wed, 25 Jun 2025, Mickaël Salaün wrote:
>> On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
>>>
>>> Can you spell out the minimum that you need?
>>
>> Sure.  We'd like to call this new helper in a RCU
>> read-side critical section and leverage this capability to speed up path
>> walk when there is no concurrent hierarchy modification.  This use case
>> is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().
>>
>> The main issue with this approach is to keep some state of the path walk
>> to know if the next call to "path_walk_parent_rcu()" would be valid
>> (i.e. something like a very light version of nameidata, mainly sequence
>> integers), and to get back to the non-RCU version otherwise.
>>
>>>
>>> My vague impression is that you want to search up from a given strut path,
>>> no further then some other given path, looking for a dentry that matches
>>> some rule.  Is that correct?
>>
>> Yes
>>
>>>
>>> In general, the original dentry could be moved away from under the
>>> dentry you find moments after the match is reported.  What mechanisms do
>>> you have in place to ensure this doesn't happen, or that it doesn't
>>> matter?
>>
>> In the case of Landlock, by default, a set of access rights are denied
>> and can only be allowed by an element in the file hierarchy.  The goal
>> is to only allow access to files under a specific directory (or directly
>> a specific file).  That's why we only care of the file hierarchy at the
>> time of access check.  It's not an issue if the file/directory was
>> moved or is being moved as long as we can walk its "current" hierarchy.
>> Furthermore, a sandboxed process is restricted from doing arbitrary
>> mounts (and renames/links are controlled with the
>> LANDLOCK_ACCESS_FS_REFER right).
>>
>> However, we need to get a valid "snapshot" of the set of dentries that
>> (could) lead to the evaluated file/directory.
> 
> A "snapshot" is an interesting idea - though looking at the landlock
> code you one need inodes, not dentries.
> I imagine an interface where you give it a starting path, a root, and
> and array of inode pointers, and it fills in the pointers with the path
> - all under rcu so no references are needed.
> But you would need some fallback if the array isn't big enough, so maybe
> that isn't a good idea.
> 
> Based on the comments by Al and Christian, I think the only viable
> approach is to pass a callback to some vfs function that does the
> walking.
> 
>    vfs_walk_ancestors(struct path *path, struct path *root,
> 		      int (*walk_cb)(struct path *ancestor, void *data),
> 		      void *data)
> 
> where walk_cb() returns a negative number if it wants to abort, and is
> given a NULL ancestor if vfs_walk_ancestors() needed to restart.
> 
> vfs_walk_ancestors() would initialise a "struct nameidata" and
> effectively call handle_dots(&nd, LAST_DOTDOT) repeatedly, calling
>     walk_cb(&nd.path, data)
> each time.

handle_dots semantically does more than dget_parent + choose_mountpoint
tho (which is what Landlock currently does, and is also what Song's
iterator will do).  There is the step_into which will step into
mountpoints (there is also code to handle symlinks, although I'm not sure
if that's relevant for following ".."), and it will also return ENOENT if
the path is disconnected.

Also I guess we might not need to have an entire nameidata?  In theory it
only needs to do what follow_dotdot_rcu does without the path_connected
check.  So it seems like given we have path and root as function argument,
it would only need nd->{seq,m_seq}.

I might be wrong tho, but certainly the behaviour is different.

> 
> How would you feel about that sort of interface?

I can't speak for Mickaël, but a callback-based interface is less flexible
(and _maybe_ less performant?).  Also, probably we will want to fallback
to a reference-taking walk if the walk fails (rather than, say, retry
infinitely), and this should probably use Song's proposed iterator.  I'm
not sure if Song would be keen to rewrite this iterator patch series in
callback style (to be clear, it doesn't necessarily seem like a good idea
to me, and I'm not asking him to), which means that we will end up with
the reference walk API being a "call this function repeatedly", and the
rcu walk API taking a callback.  I think it is still workable (after all,
if Landlock wants to reuse the code in the callback it can just call the
callback function itself when doing the reference walk), but it seems a
bit "ugly" to me.

But this is just my opinion, and if there is a stronger desire to not
expose any VFS seqcount integers then maybe we will just need to work with
a callback.

Quick note in case anyone reading this has not seen it, a while ago I made
a POC of a non-callback style API for rcu parent walk based on Song's
series:
https://lore.kernel.org/all/dbc7ee0f1f483b7bc2ec9757672a38d99015e9ae.1749402769@maowtm.org/#iZ31fs:namei.c

> 
> NeilBrown
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26  0:07           ` Tingmao Wang
@ 2025-06-26  1:05             ` NeilBrown
  2025-06-26  5:52               ` Song Liu
  2025-07-07 10:43               ` Christian Brauner
  0 siblings, 2 replies; 50+ messages in thread
From: NeilBrown @ 2025-06-26  1:05 UTC (permalink / raw)
  To: Tingmao Wang
  Cc: Mickaël Salaün, Song Liu, bpf, linux-fsdevel,
	linux-kernel, linux-security-module, brauner, kernel-team, andrii,
	eddyz87, ast, daniel, martin.lau, viro, jack, kpsingh,
	mattbobrowski, Günther Noack

On Thu, 26 Jun 2025, Tingmao Wang wrote:
> On 6/26/25 00:04, NeilBrown wrote:
> > On Wed, 25 Jun 2025, Mickaël Salaün wrote:
> >> On Wed, Jun 25, 2025 at 07:38:53AM +1000, NeilBrown wrote:
> >>>
> >>> Can you spell out the minimum that you need?
> >>
> >> Sure.  We'd like to call this new helper in a RCU
> >> read-side critical section and leverage this capability to speed up path
> >> walk when there is no concurrent hierarchy modification.  This use case
> >> is similar to handle_dots() with LOOKUP_RCU calling follow_dotdot_rcu().
> >>
> >> The main issue with this approach is to keep some state of the path walk
> >> to know if the next call to "path_walk_parent_rcu()" would be valid
> >> (i.e. something like a very light version of nameidata, mainly sequence
> >> integers), and to get back to the non-RCU version otherwise.
> >>
> >>>
> >>> My vague impression is that you want to search up from a given strut path,
> >>> no further then some other given path, looking for a dentry that matches
> >>> some rule.  Is that correct?
> >>
> >> Yes
> >>
> >>>
> >>> In general, the original dentry could be moved away from under the
> >>> dentry you find moments after the match is reported.  What mechanisms do
> >>> you have in place to ensure this doesn't happen, or that it doesn't
> >>> matter?
> >>
> >> In the case of Landlock, by default, a set of access rights are denied
> >> and can only be allowed by an element in the file hierarchy.  The goal
> >> is to only allow access to files under a specific directory (or directly
> >> a specific file).  That's why we only care of the file hierarchy at the
> >> time of access check.  It's not an issue if the file/directory was
> >> moved or is being moved as long as we can walk its "current" hierarchy.
> >> Furthermore, a sandboxed process is restricted from doing arbitrary
> >> mounts (and renames/links are controlled with the
> >> LANDLOCK_ACCESS_FS_REFER right).
> >>
> >> However, we need to get a valid "snapshot" of the set of dentries that
> >> (could) lead to the evaluated file/directory.
> > 
> > A "snapshot" is an interesting idea - though looking at the landlock
> > code you one need inodes, not dentries.
> > I imagine an interface where you give it a starting path, a root, and
> > and array of inode pointers, and it fills in the pointers with the path
> > - all under rcu so no references are needed.
> > But you would need some fallback if the array isn't big enough, so maybe
> > that isn't a good idea.
> > 
> > Based on the comments by Al and Christian, I think the only viable
> > approach is to pass a callback to some vfs function that does the
> > walking.
> > 
> >    vfs_walk_ancestors(struct path *path, struct path *root,
> > 		      int (*walk_cb)(struct path *ancestor, void *data),
> > 		      void *data)
> > 
> > where walk_cb() returns a negative number if it wants to abort, and is
> > given a NULL ancestor if vfs_walk_ancestors() needed to restart.
> > 
> > vfs_walk_ancestors() would initialise a "struct nameidata" and
> > effectively call handle_dots(&nd, LAST_DOTDOT) repeatedly, calling
> >     walk_cb(&nd.path, data)
> > each time.
> 
> handle_dots semantically does more than dget_parent + choose_mountpoint
> tho (which is what Landlock currently does, and is also what Song's
> iterator will do).  There is the step_into which will step into
> mountpoints (there is also code to handle symlinks, although I'm not sure
> if that's relevant for following ".."), and it will also return ENOENT if
> the path is disconnected.

Is any of this a problem for you?

> 
> Also I guess we might not need to have an entire nameidata?  In theory it
> only needs to do what follow_dotdot_rcu does without the path_connected
> check.  So it seems like given we have path and root as function argument,
> it would only need nd->{seq,m_seq}.

Those are implementation details internal to namei.c.  Certainly this
function wouldn't use all of the fields in nameidata, but it doesn't
hurt to have a few fields in a struct on the stack which don't get used.
Keeping the code simple and uniform is much more important.  Using
nameidata would help achieve that.

> 
> I might be wrong tho, but certainly the behaviour is different.

Here we get back to the question of "precisely what behaviour do you
need?".
"The same as what a previous patch did" is not a reasonable answer.

If, from userspace, you repeatedly did chdir("..") and then examined the
current directory you would get exactly the sequence of directories
provided by repeatedly calling handle_dots(..., LAST_DOTDOT).  If there
is some circumstance where that would be not acceptable for your use
case, you need to explain (and we need to document) what differences you
need and why use need it.

> 
> > 
> > How would you feel about that sort of interface?
> 
> I can't speak for Mickaël, but a callback-based interface is less flexible
> (and _maybe_ less performant?).  Also, probably we will want to fallback
> to a reference-taking walk if the walk fails (rather than, say, retry
> infinitely), and this should probably use Song's proposed iterator.  I'm
> not sure if Song would be keen to rewrite this iterator patch series in
> callback style (to be clear, it doesn't necessarily seem like a good idea
> to me, and I'm not asking him to), which means that we will end up with
> the reference walk API being a "call this function repeatedly", and the
> rcu walk API taking a callback.  I think it is still workable (after all,
> if Landlock wants to reuse the code in the callback it can just call the
> callback function itself when doing the reference walk), but it seems a
> bit "ugly" to me.

call-back can have a performance impact (less opportunity for compiler
optimisation and CPU speculation), though less than taking spinlock and
references.  However Al and Christian have drawn a hard line against
making seq numbers visible outside VFS code so I think it is the
approach most likely to be accepted.

Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
resulted in -ECHILD - just like all other path walking code in namei.c.
This would be largely transparent to the caller - the caller would only
see that the callback received a NULL path indicating a restart.  It
wouldn't need to know why.

NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26  1:05             ` NeilBrown
@ 2025-06-26  5:52               ` Song Liu
  2025-06-26  9:43                 ` Mickaël Salaün
  2025-06-26 10:22                 ` NeilBrown
  2025-07-07 10:43               ` Christian Brauner
  1 sibling, 2 replies; 50+ messages in thread
From: Song Liu @ 2025-06-26  5:52 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tingmao Wang, Mickaël Salaün, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, brauner@kernel.org,
	Kernel Team, andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack



> On Jun 25, 2025, at 6:05 PM, NeilBrown <neil@brown.name> wrote:

[...]

>> 
>> I can't speak for Mickaël, but a callback-based interface is less flexible
>> (and _maybe_ less performant?).  Also, probably we will want to fallback
>> to a reference-taking walk if the walk fails (rather than, say, retry
>> infinitely), and this should probably use Song's proposed iterator.  I'm
>> not sure if Song would be keen to rewrite this iterator patch series in
>> callback style (to be clear, it doesn't necessarily seem like a good idea
>> to me, and I'm not asking him to), which means that we will end up with
>> the reference walk API being a "call this function repeatedly", and the
>> rcu walk API taking a callback.  I think it is still workable (after all,
>> if Landlock wants to reuse the code in the callback it can just call the
>> callback function itself when doing the reference walk), but it seems a
>> bit "ugly" to me.
> 
> call-back can have a performance impact (less opportunity for compiler
> optimisation and CPU speculation), though less than taking spinlock and
> references.  However Al and Christian have drawn a hard line against
> making seq numbers visible outside VFS code so I think it is the
> approach most likely to be accepted.
> 
> Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
> resulted in -ECHILD - just like all other path walking code in namei.c.
> This would be largely transparent to the caller - the caller would only
> see that the callback received a NULL path indicating a restart.  It
> wouldn't need to know why.

I guess I misunderstood the proposal of vfs_walk_ancestors() 
initially, so some clarification:

I think vfs_walk_ancestors() is good for the rcu-walk, and some 
rcu-then-ref-walk. However, I don’t think it fits all use cases. 
A reliable step-by-step ref-walk, like this set, works well with 
BPF, and we want to keep it. 

Can we ship this set as-is (or after fixing the comment reported
by kernel test robot)? I really don’t think we need figure out 
all details about the rcu-walk here. 

Once this is landed, we can try implementing the rcu-walk
(vfs_walk_ancestors or some variation). If no one volunteers, I
can give it a try. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26  5:52               ` Song Liu
@ 2025-06-26  9:43                 ` Mickaël Salaün
  2025-06-26 14:49                   ` Song Liu
  2025-06-26 10:22                 ` NeilBrown
  1 sibling, 1 reply; 50+ messages in thread
From: Mickaël Salaün @ 2025-06-26  9:43 UTC (permalink / raw)
  To: Song Liu
  Cc: NeilBrown, Tingmao Wang, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, brauner@kernel.org,
	Kernel Team, andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack

On Thu, Jun 26, 2025 at 05:52:50AM +0000, Song Liu wrote:
> 
> 
> > On Jun 25, 2025, at 6:05 PM, NeilBrown <neil@brown.name> wrote:
> 
> [...]
> 
> >> 
> >> I can't speak for Mickaël, but a callback-based interface is less flexible
> >> (and _maybe_ less performant?).  Also, probably we will want to fallback
> >> to a reference-taking walk if the walk fails (rather than, say, retry
> >> infinitely), and this should probably use Song's proposed iterator.  I'm
> >> not sure if Song would be keen to rewrite this iterator patch series in
> >> callback style (to be clear, it doesn't necessarily seem like a good idea
> >> to me, and I'm not asking him to), which means that we will end up with
> >> the reference walk API being a "call this function repeatedly", and the
> >> rcu walk API taking a callback.  I think it is still workable (after all,
> >> if Landlock wants to reuse the code in the callback it can just call the
> >> callback function itself when doing the reference walk), but it seems a
> >> bit "ugly" to me.
> > 
> > call-back can have a performance impact (less opportunity for compiler
> > optimisation and CPU speculation), though less than taking spinlock and
> > references.  However Al and Christian have drawn a hard line against
> > making seq numbers visible outside VFS code so I think it is the
> > approach most likely to be accepted.
> > 
> > Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
> > resulted in -ECHILD - just like all other path walking code in namei.c.
> > This would be largely transparent to the caller - the caller would only
> > see that the callback received a NULL path indicating a restart.  It
> > wouldn't need to know why.

Given the constraints this looks good to me.  Here is an updated API
with two extra consts, an updated walk_cb() signature, and a new
"flags" and without @root:

int vfs_walk_ancestors(struct path *path,
                       bool (*walk_cb)(const struct path *ancestor, void *data),
                       void *data, int flags)

The walk continue while walk_cb() returns true.  walk_cb() can then
check if @ancestor is equal to a @root, or other properties.  The
walk_cb() return value (if not bool) should not be returned by
vfs_walk_ancestors() because a walk stop doesn't mean an error.

@path would be updated with latest ancestor path (e.g. @root).
@flags could contain LOOKUP_RCU or not, which enables us to have
walk_cb() not-RCU compatible.

When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
failed with -ECHILD, the caller can restart the walk by calling
vfs_walk_ancestors() again but without LOOKUP_RCU.

> 
> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> initially, so some clarification:
> 
> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> A reliable step-by-step ref-walk, like this set, works well with 
> BPF, and we want to keep it. 

The above updated API should work for both use cases: if the caller wants
to walk only one level, walk_cb() can just always return false (and
potentially save that it was called) and then stop the walk the first
time it is called.  This makes it possible to write an eBPF helper with
the same API as path_walk_parent(), while making the kernel API more
flexible.

> 
> Can we ship this set as-is (or after fixing the comment reported
> by kernel test robot)? I really don’t think we need figure out 
> all details about the rcu-walk here. 
> 
> Once this is landed, we can try implementing the rcu-walk
> (vfs_walk_ancestors or some variation). If no one volunteers, I
> can give it a try. 

My understanding is that Christian only wants one helper (that should
handle both use cases).  I think this updated API should be good enough
for everyone.  Most of your code should stay the same.  What do you
think?

> 
> Thanks,
> Song
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26  5:52               ` Song Liu
  2025-06-26  9:43                 ` Mickaël Salaün
@ 2025-06-26 10:22                 ` NeilBrown
  2025-06-26 14:28                   ` Song Liu
  1 sibling, 1 reply; 50+ messages in thread
From: NeilBrown @ 2025-06-26 10:22 UTC (permalink / raw)
  To: Song Liu
  Cc: Tingmao Wang, Mickaël Salaün, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, brauner@kernel.org,
	Kernel Team, andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack

On Thu, 26 Jun 2025, Song Liu wrote:
> 
> 
> > On Jun 25, 2025, at 6:05 PM, NeilBrown <neil@brown.name> wrote:
> 
> [...]
> 
> >> 
> >> I can't speak for Mickaël, but a callback-based interface is less flexible
> >> (and _maybe_ less performant?).  Also, probably we will want to fallback
> >> to a reference-taking walk if the walk fails (rather than, say, retry
> >> infinitely), and this should probably use Song's proposed iterator.  I'm
> >> not sure if Song would be keen to rewrite this iterator patch series in
> >> callback style (to be clear, it doesn't necessarily seem like a good idea
> >> to me, and I'm not asking him to), which means that we will end up with
> >> the reference walk API being a "call this function repeatedly", and the
> >> rcu walk API taking a callback.  I think it is still workable (after all,
> >> if Landlock wants to reuse the code in the callback it can just call the
> >> callback function itself when doing the reference walk), but it seems a
> >> bit "ugly" to me.
> > 
> > call-back can have a performance impact (less opportunity for compiler
> > optimisation and CPU speculation), though less than taking spinlock and
> > references.  However Al and Christian have drawn a hard line against
> > making seq numbers visible outside VFS code so I think it is the
> > approach most likely to be accepted.
> > 
> > Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
> > resulted in -ECHILD - just like all other path walking code in namei.c.
> > This would be largely transparent to the caller - the caller would only
> > see that the callback received a NULL path indicating a restart.  It
> > wouldn't need to know why.
> 
> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> initially, so some clarification:
> 
> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> A reliable step-by-step ref-walk, like this set, works well with 
> BPF, and we want to keep it. 

The distinction between rcu-walk and ref-walk is an internal
implementation detail.  You as a caller shouldn't need to think about
the difference.  You just want to walk.  Note that LOOKUP_RCU is
documented in namei.h as "semi-internal".  The only uses outside of
core-VFS code is in individual filesystem's d_revalidate handler - they
are checking if they are allowed to sleep or not.  You should never
expect to pass LOOKUP_RCU to an VFS API - no other code does.

It might be reasonable for you as a caller to have some control over
whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
But for dotdot lookup the code will never sleep - so that is not
relevant.

I strongly suggest you stop thinking about rcu-walk vs ref-walk.  Think
about the needs of your code.  If you need a high-performance API, then
ask for a high-performance API, don't assume what form it will take or
what the internal implementation details will be.

I think you already have a clear answer that a step-by-step API will not
be read-only on the dcache (i.e.  it will adjust refcounts) and so will
not be high performance.  If you want high performance, you need to
accept a different style of API.

NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26 10:22                 ` NeilBrown
@ 2025-06-26 14:28                   ` Song Liu
  2025-06-26 22:51                     ` NeilBrown
  0 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-06-26 14:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tingmao Wang, Mickaël Salaün, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, brauner@kernel.org,
	Kernel Team, andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack



> On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote:

[...]

>> I guess I misunderstood the proposal of vfs_walk_ancestors() 
>> initially, so some clarification:
>> 
>> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
>> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
>> A reliable step-by-step ref-walk, like this set, works well with 
>> BPF, and we want to keep it.
> 
> The distinction between rcu-walk and ref-walk is an internal
> implementation detail.  You as a caller shouldn't need to think about
> the difference.  You just want to walk.  Note that LOOKUP_RCU is
> documented in namei.h as "semi-internal".  The only uses outside of
> core-VFS code is in individual filesystem's d_revalidate handler - they
> are checking if they are allowed to sleep or not.  You should never
> expect to pass LOOKUP_RCU to an VFS API - no other code does.
> 
> It might be reasonable for you as a caller to have some control over
> whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
> But for dotdot lookup the code will never sleep - so that is not
> relevant.

Unfortunately, the BPF use case is more complicated. In some cases, 
the callback function cannot be call in rcu critical sections. For 
example, the callback may need to read xatter. For these cases, we
we cannot use RCU walk at all. 

> I strongly suggest you stop thinking about rcu-walk vs ref-walk.  Think
> about the needs of your code.  If you need a high-performance API, then
> ask for a high-performance API, don't assume what form it will take or
> what the internal implementation details will be.

At the moment, we need a ref-walk API on the BPF side. The RCU walk
is a totally separate topic. 

> I think you already have a clear answer that a step-by-step API will not
> be read-only on the dcache (i.e.  it will adjust refcounts) and so will
> not be high performance.  If you want high performance, you need to
> accept a different style of API.

Agreed. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26  9:43                 ` Mickaël Salaün
@ 2025-06-26 14:49                   ` Song Liu
  0 siblings, 0 replies; 50+ messages in thread
From: Song Liu @ 2025-06-26 14:49 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: NeilBrown, Tingmao Wang, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, brauner@kernel.org,
	Kernel Team, andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack



> On Jun 26, 2025, at 2:43 AM, Mickaël Salaün <mic@digikod.net> wrote:

[...]

> Given the constraints this looks good to me.  Here is an updated API
> with two extra consts, an updated walk_cb() signature, and a new
> "flags" and without @root:
> 
> int vfs_walk_ancestors(struct path *path,
>                       bool (*walk_cb)(const struct path *ancestor, void *data),
>                       void *data, int flags)
> 
> The walk continue while walk_cb() returns true.  walk_cb() can then
> check if @ancestor is equal to a @root, or other properties.  The
> walk_cb() return value (if not bool) should not be returned by
> vfs_walk_ancestors() because a walk stop doesn't mean an error.
> 
> @path would be updated with latest ancestor path (e.g. @root).
> @flags could contain LOOKUP_RCU or not, which enables us to have
> walk_cb() not-RCU compatible.
> 
> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> failed with -ECHILD, the caller can restart the walk by calling
> vfs_walk_ancestors() again but without LOOKUP_RCU.

IIUC, Neil is against using LOOKUP_RCU as input, and VFS folks
may share same the concerns. 

>> 
>> I guess I misunderstood the proposal of vfs_walk_ancestors() 
>> initially, so some clarification:
>> 
>> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
>> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
>> A reliable step-by-step ref-walk, like this set, works well with 
>> BPF, and we want to keep it.
> 
> The above updated API should work for both use cases: if the caller wants
> to walk only one level, walk_cb() can just always return false (and
> potentially save that it was called) and then stop the walk the first
> time it is called.  This makes it possible to write an eBPF helper with
> the same API as path_walk_parent(), while making the kernel API more
> flexible.

I don’t think this will be the same. Current path_walk_parent() 
holds a reference on parent path, and returns control to the caller. 
However, when vfs_walk_ancestors() returns, it should not hold any
extra reference, right? IOW, vfs_walk_ancestors may hold some 
reference between callbacks, but is expected to release these 
references before finally returning to the caller.

> Can we ship this set as-is (or after fixing the comment reported
>> by kernel test robot)? I really don’t think we need figure out 
>> all details about the rcu-walk here. 
>> 
>> Once this is landed, we can try implementing the rcu-walk
>> (vfs_walk_ancestors or some variation). If no one volunteers, I
>> can give it a try.
> 
> My understanding is that Christian only wants one helper (that should
> handle both use cases).  I think this updated API should be good enough
> for everyone.  Most of your code should stay the same.  What do you
> think?

Christian, could you please clarify this requirement?

Given different expectations in how the references are handled (see 
above), I don’t think we can fit all use cases in one API. However, 
if we find such an API in the future, which works for all cases, we 
can refactor BPF side code to use that instead. Therefore, even we 
have a “one API only” requirement, it is not necessary to delay this
set for a RCU-walk API. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26 14:28                   ` Song Liu
@ 2025-06-26 22:51                     ` NeilBrown
  2025-06-27  0:21                       ` Song Liu
  2025-07-07 10:46                       ` Christian Brauner
  0 siblings, 2 replies; 50+ messages in thread
From: NeilBrown @ 2025-06-26 22:51 UTC (permalink / raw)
  To: Song Liu
  Cc: Tingmao Wang, Mickaël Salaün, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, brauner@kernel.org,
	Kernel Team, andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack

On Fri, 27 Jun 2025, Song Liu wrote:
> 
> 
> > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote:
> 
> [...]
> 
> >> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> >> initially, so some clarification:
> >> 
> >> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> >> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> >> A reliable step-by-step ref-walk, like this set, works well with 
> >> BPF, and we want to keep it.
> > 
> > The distinction between rcu-walk and ref-walk is an internal
> > implementation detail.  You as a caller shouldn't need to think about
> > the difference.  You just want to walk.  Note that LOOKUP_RCU is
> > documented in namei.h as "semi-internal".  The only uses outside of
> > core-VFS code is in individual filesystem's d_revalidate handler - they
> > are checking if they are allowed to sleep or not.  You should never
> > expect to pass LOOKUP_RCU to an VFS API - no other code does.
> > 
> > It might be reasonable for you as a caller to have some control over
> > whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
> > But for dotdot lookup the code will never sleep - so that is not
> > relevant.
> 
> Unfortunately, the BPF use case is more complicated. In some cases, 
> the callback function cannot be call in rcu critical sections. For 
> example, the callback may need to read xatter. For these cases, we
> we cannot use RCU walk at all. 

I really think you should stop using the terms RCU walk and ref-walk.  I
think they might be focusing your thinking in an unhelpful direction.

The key issue about reading xattrs is that it might need to sleep.
Focusing on what might need to sleep and what will never need to sleep
is a useful approach - the distinction is wide spread in the kernel and
several function take a flag indicating if they are permitted to sleep,
or if failure when sleeping would be required.

So your above observation is better described as 

   The vfs_walk_ancestors() API has an (implicit) requirement that the
   callback mustn't sleep.  This is a problem for some use-cases
   where the call back might need to sleep - e.g. for accessing xattrs.

That is a good and useful observation.  I can see three possibly
responses:

1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
   always allowed to sleep.  I don't particularly like this approach.

2/ Use repeated calls to vfs_walk_parent() when the handling of each
   ancestor might need to sleep.  I see no problem with supporting both
   vfs_walk_ancestors() and vfs_walk_parent().  There is plenty of
   precedent for having different  interfaces for different use cases.

3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
   If the callback finds that it needs to sleep but that "may sleep"
   isn't set, it returns some well known status, like -EWOULDBLOCK (or
   -ECHILD).  It can expect to be called again but with "may sleep" set.
   This is my preferred approach. There is precedent with the
   d_revalidate callbacks which works like this.
   I suspect that accessing xattrs might often be possible without
   sleeping.  It is conceivable that we could add a "may sleep" argument
   to vfs_getxattr() so that it could still often be used without
   requiring vfs_walk_ancestors() to permit sleeping.
   This would almost certainly require a clear demonstration that 
   there was a performance cost in not having the option of non-sleeping
   vfs_getxattr().

> 
> > I strongly suggest you stop thinking about rcu-walk vs ref-walk.  Think
> > about the needs of your code.  If you need a high-performance API, then
> > ask for a high-performance API, don't assume what form it will take or
> > what the internal implementation details will be.
> 
> At the moment, we need a ref-walk API on the BPF side. The RCU walk
> is a totally separate topic. 

Do you mean "we need step-by-step walking" or do you mean "we need to
potentially sleep for each ancestor"?  These are conceptually different
requirements, but I cannot tell which you mean when you talk about "RCU
walk".

Thanks,
NeilBrown

> 
> > I think you already have a clear answer that a step-by-step API will not
> > be read-only on the dcache (i.e.  it will adjust refcounts) and so will
> > not be high performance.  If you want high performance, you need to
> > accept a different style of API.
> 
> Agreed. 
> 
> Thanks,
> Song
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26 22:51                     ` NeilBrown
@ 2025-06-27  0:21                       ` Song Liu
  2025-07-07 10:46                       ` Christian Brauner
  1 sibling, 0 replies; 50+ messages in thread
From: Song Liu @ 2025-06-27  0:21 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tingmao Wang, Mickaël Salaün, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, brauner@kernel.org,
	Kernel Team, andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack



> On Jun 26, 2025, at 3:51 PM, NeilBrown <neil@brown.name> wrote:

[...]

>> Unfortunately, the BPF use case is more complicated. In some cases, 
>> the callback function cannot be call in rcu critical sections. For 
>> example, the callback may need to read xatter. For these cases, we
>> we cannot use RCU walk at all.
> 
> I really think you should stop using the terms RCU walk and ref-walk.  I
> think they might be focusing your thinking in an unhelpful direction.
> 
> The key issue about reading xattrs is that it might need to sleep.
> Focusing on what might need to sleep and what will never need to sleep
> is a useful approach - the distinction is wide spread in the kernel and
> several function take a flag indicating if they are permitted to sleep,
> or if failure when sleeping would be required.
> 
> So your above observation is better described as 
> 
>   The vfs_walk_ancestors() API has an (implicit) requirement that the
>   callback mustn't sleep.  This is a problem for some use-cases
>   where the call back might need to sleep - e.g. for accessing xattrs.
> 
> That is a good and useful observation.  I can see three possibly
> responses:
> 
> 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
>   always allowed to sleep.  I don't particularly like this approach.
> 
> 2/ Use repeated calls to vfs_walk_parent() when the handling of each
>   ancestor might need to sleep.  I see no problem with supporting both
>   vfs_walk_ancestors() and vfs_walk_parent().  There is plenty of
>   precedent for having different  interfaces for different use cases.

I prefer option 2. 

> 
> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
>   If the callback finds that it needs to sleep but that "may sleep"
>   isn't set, it returns some well known status, like -EWOULDBLOCK (or
>   -ECHILD).  It can expect to be called again but with "may sleep" set.
>   This is my preferred approach. There is precedent with the
>   d_revalidate callbacks which works like this.
>   I suspect that accessing xattrs might often be possible without
>   sleeping.  It is conceivable that we could add a "may sleep" argument
>   to vfs_getxattr() so that it could still often be used without
>   requiring vfs_walk_ancestors() to permit sleeping.
>   This would almost certainly require a clear demonstration that 
>   there was a performance cost in not having the option of non-sleeping
>   vfs_getxattr().

For built-in kernel code, I can see this works. However, I don’t see 
why it is necessary to introduce the extra complexity of -EWOULDBLOCK, 
and vfs_get_xattr_cannot_sleep, etc. A separate step-by-step walking
API is much cleaner.

> 
>>> I strongly suggest you stop thinking about rcu-walk vs ref-walk.  Think
>>> about the needs of your code.  If you need a high-performance API, then
>>> ask for a high-performance API, don't assume what form it will take or
>>> what the internal implementation details will be.
>> 
>> At the moment, we need a ref-walk API on the BPF side. The RCU walk
>> is a totally separate topic.
> 
> Do you mean "we need step-by-step walking" or do you mean "we need to
> potentially sleep for each ancestor"?  These are conceptually different
> requirements, but I cannot tell which you mean when you talk about "RCU
> walk”.

To be extra clear, I mean we need "step-by-step and 
take-reference-on-each-step walking”, for existing use cases. 

In the future, if it is possible to have a “do-not-take-reference, 
cannot-sleep, callback-based walking”. We may try to use that for 
some use cases. But that won’t replace step-by-step walking for 
all users. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-24 18:45   ` Mickaël Salaün
  2025-06-24 21:38     ` NeilBrown
@ 2025-07-03  5:04     ` Song Liu
  1 sibling, 0 replies; 50+ messages in thread
From: Song Liu @ 2025-07-03  5:04 UTC (permalink / raw)
  To: Mickaël Salaün, brauner
  Cc: bpf, linux-fsdevel, linux-kernel, linux-security-module,
	kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro, jack,
	kpsingh, mattbobrowski, m, neil, Günther Noack

Hi Christian,

On Tue, Jun 24, 2025 at 11:46 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Fri, Jun 20, 2025 at 02:59:17PM -0700, Song Liu wrote:
> > Hi Christian, Mickaël, and folks,
> >
> > Could you please share your comments on this version? Does this
> > look sane?
>
> This looks good to me but we need to know what is the acceptable next
> step to support RCU.  If we can go with another _rcu helper, I'm good
> with the current approach, otherwise we need to figure out a way to
> leverage the current helper to make it compatible with callers being in
> a RCU read-side critical section while leveraging safe path walk (i.e.
> several calls to path_walk_parent).

Could you please share your suggestions on this topic? RCU
protected path walk out of fs/ seems controversial in multiple
ways. Do we have to let this set wait indefinitely for a solution
of RCU protected path walk? I would like to highlight that this
set doesn't add any persistent APIs. path_walk_parent() is not
in the UAPI, nor exported. If a newer and better API is created,
we can refactor bpf and landlock code and deprecate
path_walk_parent().

Thanks,
Song

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent()
  2025-06-17  6:11 ` [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent() Song Liu
@ 2025-07-03 18:29   ` Mickaël Salaün
  2025-07-03 22:27     ` Song Liu
  0 siblings, 1 reply; 50+ messages in thread
From: Mickaël Salaün @ 2025-07-03 18:29 UTC (permalink / raw)
  To: Song Liu, brauner
  Cc: bpf, linux-fsdevel, linux-kernel, linux-security-module,
	kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro, jack,
	kpsingh, mattbobrowski, m, neil, Günther Noack, Jann Horn

On Mon, Jun 16, 2025 at 11:11:13PM -0700, Song Liu wrote:
> Use path_walk_parent() to walk a path up to its parent.
> 
> No functional changes intended.

Using this helper actualy fixes the issue highlighted by Al.  Even if it
was reported after the first version of this patch series, the issue
should be explained in the commit message and these tags should be
added:

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Closes: https://lore.kernel.org/r/20250529231018.GP2023217@ZenIV
Fixes: cb2c7d1a1776 ("landlock: Support filesystem access-control")

I like this new helper but we should have a clear plan to be able to
call such helper in a RCU read-side critical section before we merge
this series.  We're still waiting for Christian.

I sent a patch to fix the handling of disconnected directories for
Landlock, and it will need to be backported:
https://lore.kernel.org/all/20250701183812.3201231-1-mic@digikod.net/
Unfortunately a rebase would be needed for the path_walk_parent patch,
but I can take it in my tree if everyone is OK.

However, users of path_walk_parent() would still have to properly deal
with such disconnected directories.  The Landlock fix I sent takes a
safe approach by handling disconnected directories such as only their
mount point is actually taken into account for access control decision
(see rationale in the patch series).  I'm wondering if
path_walk_parent() should not help its users avoid the same issue, or at
least force them to make an explicit and informed choice.

> 
> Signed-off-by: Song Liu <song@kernel.org>
> ---
>  security/landlock/fs.c | 30 ++++++++----------------------
>  1 file changed, 8 insertions(+), 22 deletions(-)
> 
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index 6fee7c20f64d..e26ab8c34dd4 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -837,8 +837,8 @@ static bool is_access_to_paths_allowed(
>  	 * restriction.
>  	 */
>  	while (true) {
> -		struct dentry *parent_dentry;
>  		const struct landlock_rule *rule;
> +		struct path root = {};
>  
>  		/*
>  		 * If at least all accesses allowed on the destination are
> @@ -895,34 +895,20 @@ static bool is_access_to_paths_allowed(
>  		/* Stops when a rule from each layer grants access. */
>  		if (allowed_parent1 && allowed_parent2)
>  			break;
> -jump_up:
> -		if (walker_path.dentry == walker_path.mnt->mnt_root) {
> -			if (follow_up(&walker_path)) {
> -				/* Ignores hidden mount points. */
> -				goto jump_up;
> -			} else {
> -				/*
> -				 * Stops at the real root.  Denies access
> -				 * because not all layers have granted access.
> -				 */
> -				break;
> -			}
> -		}
> -		if (unlikely(IS_ROOT(walker_path.dentry))) {
> +
> +		if (unlikely(IS_ROOT(walker_path.dentry)) &&
> +		    (walker_path.mnt->mnt_flags & MNT_INTERNAL)) {

This would not fit well with the ongoing Landlock fix because
!MNT_INTERNAL root directories should also be handled specifically, but
only if they are not mount points.

>  			/*
>  			 * Stops at disconnected root directories.  Only allows
>  			 * access to internal filesystems (e.g. nsfs, which is
>  			 * reachable through /proc/<pid>/ns/<namespace>).
>  			 */
> -			if (walker_path.mnt->mnt_flags & MNT_INTERNAL) {
> -				allowed_parent1 = true;
> -				allowed_parent2 = true;
> -			}
> +			allowed_parent1 = true;
> +			allowed_parent2 = true;
>  			break;
>  		}
> -		parent_dentry = dget_parent(walker_path.dentry);
> -		dput(walker_path.dentry);
> -		walker_path.dentry = parent_dentry;
> +		if (path_walk_parent(&walker_path, &root))
> +			break;
>  	}
>  	path_put(&walker_path);
>  
> -- 
> 2.47.1
> 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent()
  2025-07-03 18:29   ` Mickaël Salaün
@ 2025-07-03 22:27     ` Song Liu
  2025-07-04  9:00       ` Mickaël Salaün
  0 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-07-03 22:27 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Song Liu, brauner@kernel.org, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, m@maowtm.org, neil@brown.name,
	Günther Noack, Jann Horn

Hi Mickaël,

> On Jul 3, 2025, at 11:29 AM, Mickaël Salaün <mic@digikod.net> wrote:
> 
> On Mon, Jun 16, 2025 at 11:11:13PM -0700, Song Liu wrote:
>> Use path_walk_parent() to walk a path up to its parent.
>> 
>> No functional changes intended.
> 
> Using this helper actualy fixes the issue highlighted by Al.  Even if it
> was reported after the first version of this patch series, the issue
> should be explained in the commit message and these tags should be
> added:
> 
> Reported-by: Al Viro <viro@zeniv.linux.org.uk>
> Closes: https://lore.kernel.org/r/20250529231018.GP2023217@ZenIV 
> Fixes: cb2c7d1a1776 ("landlock: Support filesystem access-control")
> 
> I like this new helper but we should have a clear plan to be able to
> call such helper in a RCU read-side critical section before we merge
> this series.  We're still waiting for Christian.
> 
> I sent a patch to fix the handling of disconnected directories for
> Landlock, and it will need to be backported:
> https://lore.kernel.org/all/20250701183812.3201231-1-mic@digikod.net/
> Unfortunately a rebase would be needed for the path_walk_parent patch,
> but I can take it in my tree if everyone is OK.

The fix above also touches VFS code (makes path_connected available 
out of namei.c. It probably should also go through VFS tree? 

Maybe you can send 1/5 and 2/5 of this set (with necessary changes) 
and your fix together to VFS tree. Then, I will see how to route the
BPF side patches. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent()
  2025-07-03 22:27     ` Song Liu
@ 2025-07-04  9:00       ` Mickaël Salaün
  2025-07-06 22:29         ` Song Liu
  2025-07-07 10:28         ` Christian Brauner
  0 siblings, 2 replies; 50+ messages in thread
From: Mickaël Salaün @ 2025-07-04  9:00 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, brauner@kernel.org, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, m@maowtm.org, neil@brown.name,
	Günther Noack, Jann Horn

On Thu, Jul 03, 2025 at 10:27:02PM +0000, Song Liu wrote:
> Hi Mickaël,
> 
> > On Jul 3, 2025, at 11:29 AM, Mickaël Salaün <mic@digikod.net> wrote:
> > 
> > On Mon, Jun 16, 2025 at 11:11:13PM -0700, Song Liu wrote:
> >> Use path_walk_parent() to walk a path up to its parent.
> >> 
> >> No functional changes intended.
> > 
> > Using this helper actualy fixes the issue highlighted by Al.  Even if it
> > was reported after the first version of this patch series, the issue
> > should be explained in the commit message and these tags should be
> > added:
> > 
> > Reported-by: Al Viro <viro@zeniv.linux.org.uk>
> > Closes: https://lore.kernel.org/r/20250529231018.GP2023217@ZenIV 
> > Fixes: cb2c7d1a1776 ("landlock: Support filesystem access-control")
> > 
> > I like this new helper but we should have a clear plan to be able to
> > call such helper in a RCU read-side critical section before we merge
> > this series.  We're still waiting for Christian.
> > 
> > I sent a patch to fix the handling of disconnected directories for
> > Landlock, and it will need to be backported:
> > https://lore.kernel.org/all/20250701183812.3201231-1-mic@digikod.net/
> > Unfortunately a rebase would be needed for the path_walk_parent patch,
> > but I can take it in my tree if everyone is OK.
> 
> The fix above also touches VFS code (makes path_connected available 
> out of namei.c. It probably should also go through VFS tree? 
> 
> Maybe you can send 1/5 and 2/5 of this set (with necessary changes) 
> and your fix together to VFS tree. Then, I will see how to route the
> BPF side patches. 

That could work, but because it would be much more Landlock-specific
code than VFS-specific code, and there will probably be a few versions
of my fixes, I'd prefer to keep this into my tree if VFS folks are OK.
BTW, my fixes already touch the VFS subsystem a bit.

However, as pointed out in my previous email, the disconnected directory
case should be carefully considered for the path_walk_parent() users to
avoid BPF LSM programs having the same issue I'm fixing for Landlock.
The safe approaches I can think of to avoid this issue for BPF programs
while making the interface efficient (by not calling path_connected()
after each path_walk_parent() call) is to either have some kind of
iterator as Tingmao proposed, or a callback function as Neil proposed.
The callback approach looks simpler and more future-proof, but I guess
you'll have to make it compatible with the eBPF runtime.  I think the
best approach would be to have a VFS API with a callback, and a BPF
helper (leveraging this VFS API) with an iterator state.

I'm aware that this disconnected directory fix might delay your patch
series, but the good news is that it's an opportunity for eBPF programs
to not have the issue I'm fixing for Landlock.

> 
> Thanks,
> Song
> 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent()
  2025-06-17  6:11 ` [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent() Song Liu
  2025-06-18  1:02   ` kernel test robot
  2025-06-24 12:18   ` Jan Kara
@ 2025-07-04 17:40   ` Yonghong Song
  2025-07-06 23:54     ` Song Liu
  2 siblings, 1 reply; 50+ messages in thread
From: Yonghong Song @ 2025-07-04 17:40 UTC (permalink / raw)
  To: Song Liu, bpf, linux-fsdevel, linux-kernel, linux-security-module
  Cc: kernel-team, andrii, eddyz87, ast, daniel, martin.lau, viro,
	brauner, jack, kpsingh, mattbobrowski, m, neil



On 6/16/25 11:11 PM, Song Liu wrote:
> This helper walks an input path to its parent. Logic are added to handle
> walking across mount tree.
>
> This will be used by landlock, and BPF LSM.
>
> Suggested-by: Neil Brown <neil@brown.name>
> Signed-off-by: Song Liu <song@kernel.org>
> ---
>   fs/namei.c            | 95 +++++++++++++++++++++++++++++++++++--------
>   include/linux/namei.h |  2 +
>   2 files changed, 79 insertions(+), 18 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 4bb889fc980b..d0557c0b5cc8 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2048,36 +2048,95 @@ static struct dentry *follow_dotdot_rcu(struct nameidata *nd)
>   	return nd->path.dentry;
>   }
>   
> -static struct dentry *follow_dotdot(struct nameidata *nd)
> +/**
> + * __path_walk_parent - Find the parent of the given struct path
> + * @path  - The struct path to start from
> + * @root  - A struct path which serves as a boundary not to be crosses.
> + *        - If @root is zero'ed, walk all the way to global root.
> + * @flags - Some LOOKUP_ flags.
> + *
> + * Find and return the dentry for the parent of the given path
> + * (mount/dentry). If the given path is the root of a mounted tree, it
> + * is first updated to the mount point on which that tree is mounted.
> + *
> + * If %LOOKUP_NO_XDEV is given, then *after* the path is updated to a new
> + * mount, the error EXDEV is returned.
> + *
> + * If no parent can be found, either because the tree is not mounted or
> + * because the @path matches the @root, then @path->dentry is returned
> + * unless @flags contains %LOOKUP_BENEATH, in which case -EXDEV is returned.
> + *
> + * Returns: either an ERR_PTR() or the chosen parent which will have had
> + * the refcount incremented.
> + */
> +static struct dentry *__path_walk_parent(struct path *path, const struct path *root, int flags)
>   {
> -	struct dentry *parent;
> -
> -	if (path_equal(&nd->path, &nd->root))
> +	if (path_equal(path, root))
>   		goto in_root;
> -	if (unlikely(nd->path.dentry == nd->path.mnt->mnt_root)) {
> -		struct path path;
> +	if (unlikely(path->dentry == path->mnt->mnt_root)) {
> +		struct path new_path;
>   
> -		if (!choose_mountpoint(real_mount(nd->path.mnt),
> -				       &nd->root, &path))
> +		if (!choose_mountpoint(real_mount(path->mnt),
> +				       root, &new_path))
>   			goto in_root;
> -		path_put(&nd->path);
> -		nd->path = path;
> -		nd->inode = path.dentry->d_inode;
> -		if (unlikely(nd->flags & LOOKUP_NO_XDEV))
> +		path_put(path);
> +		*path = new_path;
> +		if (unlikely(flags & LOOKUP_NO_XDEV))
>   			return ERR_PTR(-EXDEV);
>   	}
>   	/* rare case of legitimate dget_parent()... */
> -	parent = dget_parent(nd->path.dentry);
> +	return dget_parent(path->dentry);

I have some confusion with this patch when crossing mount boundary.

In d_path.c, we have

static int __prepend_path(const struct dentry *dentry, const struct mount *mnt,
                           const struct path *root, struct prepend_buffer *p)
{
         while (dentry != root->dentry || &mnt->mnt != root->mnt) {
                 const struct dentry *parent = READ_ONCE(dentry->d_parent);

                 if (dentry == mnt->mnt.mnt_root) {
                         struct mount *m = READ_ONCE(mnt->mnt_parent);
                         struct mnt_namespace *mnt_ns;

                         if (likely(mnt != m)) {
                                 dentry = READ_ONCE(mnt->mnt_mountpoint);
                                 mnt = m;
                                 continue;
                         }
                         /* Global root */
                         mnt_ns = READ_ONCE(mnt->mnt_ns);
                         /* open-coded is_mounted() to use local mnt_ns */
                         if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns))
                                 return 1;       // absolute root
                         else
                                 return 2;       // detached or not attached yet
                 }

                 if (unlikely(dentry == parent))
                         /* Escaped? */
                         return 3;

                 prefetch(parent);
                 if (!prepend_name(p, &dentry->d_name))
                         break;
                 dentry = parent;
         }
         return 0;
}

At the mount boundary and not at root mount, the code has
	dentry = READ_ONCE(mnt->mnt_mountpoint);
	mnt = m; /* 'mnt' will be parent mount */
	continue;

After that, we have
	const struct dentry *parent = READ_ONCE(dentry->d_parent);
	if (dentry == mnt->mnt.mnt_root) {
		/* assume this is false */
	}
	...
	prefetch(parent);
         if (!prepend_name(p, &dentry->d_name))
                 break;
         dentry = parent;

So the prepend_name(p, &dentry->d_name) is actually from mnt->mnt_mountpoint.

In your above code, maybe we should return path->dentry in the below if statement?

         if (unlikely(path->dentry == path->mnt->mnt_root)) {
                 struct path new_path;

                 if (!choose_mountpoint(real_mount(path->mnt),
                                        root, &new_path))
                         goto in_root;
                 path_put(path);
                 *path = new_path;
                 if (unlikely(flags & LOOKUP_NO_XDEV))
                         return ERR_PTR(-EXDEV);
+		return path->dentry;
         }
         /* rare case of legitimate dget_parent()... */
         return dget_parent(path->dentry);

Also, could you add some selftests cross mount points? This will
have more coverages with __path_walk_parent().

> +
> +in_root:
> +	if (unlikely(flags & LOOKUP_BENEATH))
> +		return ERR_PTR(-EXDEV);
> +	return dget(path->dentry);
> +}
> +
> +/**
> + * path_walk_parent - Walk to the parent of path
> + * @path: input and output path.
> + * @root: root of the path walk, do not go beyond this root. If @root is
> + *        zero'ed, walk all the way to real root.
> + *
> + * Given a path, find the parent path. Replace @path with the parent path.
> + * If we were already at the real root or a disconnected root, @path is
> + * not changed.
> + *
> + * Returns:
> + *  0  - if @path is updated to its parent.
> + *  <0 - if @path is already the root (real root or @root).
> + */
> +int path_walk_parent(struct path *path, const struct path *root)
> +{
> +	struct dentry *parent;
> +
> +	parent = __path_walk_parent(path, root, LOOKUP_BENEATH);
> +
> +	if (IS_ERR(parent))
> +		return PTR_ERR(parent);
> +
> +	if (parent == path->dentry) {
> +		dput(parent);
> +		return -ENOENT;
> +	}
> +	dput(path->dentry);
> +	path->dentry = parent;
> +	return 0;
> +}
> +

[...]


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent()
  2025-07-04  9:00       ` Mickaël Salaün
@ 2025-07-06 22:29         ` Song Liu
  2025-07-07 10:28         ` Christian Brauner
  1 sibling, 0 replies; 50+ messages in thread
From: Song Liu @ 2025-07-06 22:29 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Song Liu, brauner@kernel.org, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, m@maowtm.org, neil@brown.name,
	Günther Noack, Jann Horn



> On Jul 4, 2025, at 2:00 AM, Mickaël Salaün <mic@digikod.net> wrote:
> 
> On Thu, Jul 03, 2025 at 10:27:02PM +0000, Song Liu wrote:
>> Hi Mickaël,
>> 
>>> On Jul 3, 2025, at 11:29 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>> 
>>> On Mon, Jun 16, 2025 at 11:11:13PM -0700, Song Liu wrote:
>>>> Use path_walk_parent() to walk a path up to its parent.
>>>> 
>>>> No functional changes intended.
>>> 
>>> Using this helper actualy fixes the issue highlighted by Al.  Even if it
>>> was reported after the first version of this patch series, the issue
>>> should be explained in the commit message and these tags should be
>>> added:
>>> 
>>> Reported-by: Al Viro <viro@zeniv.linux.org.uk>
>>> Closes: https://lore.kernel.org/r/20250529231018.GP2023217@ZenIV  
>>> Fixes: cb2c7d1a1776 ("landlock: Support filesystem access-control")
>>> 
>>> I like this new helper but we should have a clear plan to be able to
>>> call such helper in a RCU read-side critical section before we merge
>>> this series.  We're still waiting for Christian.
>>> 
>>> I sent a patch to fix the handling of disconnected directories for
>>> Landlock, and it will need to be backported:
>>> https://lore.kernel.org/all/20250701183812.3201231-1-mic@digikod.net/ 
>>> Unfortunately a rebase would be needed for the path_walk_parent patch,
>>> but I can take it in my tree if everyone is OK.
>> 
>> The fix above also touches VFS code (makes path_connected available 
>> out of namei.c. It probably should also go through VFS tree? 
>> 
>> Maybe you can send 1/5 and 2/5 of this set (with necessary changes) 
>> and your fix together to VFS tree. Then, I will see how to route the
>> BPF side patches.
> 
> That could work, but because it would be much more Landlock-specific
> code than VFS-specific code, and there will probably be a few versions
> of my fixes, I'd prefer to keep this into my tree if VFS folks are OK.
> BTW, my fixes already touch the VFS subsystem a bit.
> 
> However, as pointed out in my previous email, the disconnected directory
> case should be carefully considered for the path_walk_parent() users to
> avoid BPF LSM programs having the same issue I'm fixing for Landlock.
> The safe approaches I can think of to avoid this issue for BPF programs
> while making the interface efficient (by not calling path_connected()
> after each path_walk_parent() call) is to either have some kind of
> iterator as Tingmao proposed, or a callback function as Neil proposed.
> The callback approach looks simpler and more future-proof, but I guess
> you'll have to make it compatible with the eBPF runtime.  I think the
> best approach would be to have a VFS API with a callback, and a BPF
> helper (leveraging this VFS API) with an iterator state.

Since we are proposing an open-coded BPF iterator. Having a real 
callback, which is no longer an open coded iterator, requires more
work. At the moment, it is easier to just add a path_connected call
in bpf_iter_path_next(). 

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent()
  2025-07-04 17:40   ` Yonghong Song
@ 2025-07-06 23:54     ` Song Liu
  2025-07-07 17:53       ` Yonghong Song
  0 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-07-06 23:54 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, m@maowtm.org,
	neil@brown.name



> On Jul 4, 2025, at 10:40 AM, Yonghong Song <yonghong.song@linux.dev> wrote:
[...]
>> +static struct dentry *__path_walk_parent(struct path *path, const struct path *root, int flags)
>>  {
>> - struct dentry *parent;
>> -
>> - if (path_equal(&nd->path, &nd->root))
>> + if (path_equal(path, root))
>>   goto in_root;
>> - if (unlikely(nd->path.dentry == nd->path.mnt->mnt_root)) {
>> - struct path path;
>> + if (unlikely(path->dentry == path->mnt->mnt_root)) {
>> + struct path new_path;
>>  - if (!choose_mountpoint(real_mount(nd->path.mnt),
>> -       &nd->root, &path))
>> + if (!choose_mountpoint(real_mount(path->mnt),
>> +       root, &new_path))
>>   goto in_root;
>> - path_put(&nd->path);
>> - nd->path = path;
>> - nd->inode = path.dentry->d_inode;
>> - if (unlikely(nd->flags & LOOKUP_NO_XDEV))
>> + path_put(path);
>> + *path = new_path;
>> + if (unlikely(flags & LOOKUP_NO_XDEV))
>>   return ERR_PTR(-EXDEV);
>>   }
>>   /* rare case of legitimate dget_parent()... */
>> - parent = dget_parent(nd->path.dentry);
>> + return dget_parent(path->dentry);
> 
> I have some confusion with this patch when crossing mount boundary.
> 
> In d_path.c, we have
> 
> static int __prepend_path(const struct dentry *dentry, const struct mount *mnt,
>                          const struct path *root, struct prepend_buffer *p)
> {
>        while (dentry != root->dentry || &mnt->mnt != root->mnt) {
>                const struct dentry *parent = READ_ONCE(dentry->d_parent);
> 
>                if (dentry == mnt->mnt.mnt_root) {
>                        struct mount *m = READ_ONCE(mnt->mnt_parent);
>                        struct mnt_namespace *mnt_ns;
> 
>                        if (likely(mnt != m)) {
>                                dentry = READ_ONCE(mnt->mnt_mountpoint);
>                                mnt = m;
>                                continue;
>                        }
>                        /* Global root */
>                        mnt_ns = READ_ONCE(mnt->mnt_ns);
>                        /* open-coded is_mounted() to use local mnt_ns */
>                        if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns))
>                                return 1;       // absolute root
>                        else
>                                return 2;       // detached or not attached yet
>                }
> 
>                if (unlikely(dentry == parent))
>                        /* Escaped? */
>                        return 3;
> 
>                prefetch(parent);
>                if (!prepend_name(p, &dentry->d_name))
>                        break;
>                dentry = parent;
>        }
>        return 0;
> }
> 
> At the mount boundary and not at root mount, the code has
> dentry = READ_ONCE(mnt->mnt_mountpoint);
> mnt = m; /* 'mnt' will be parent mount */
> continue;
> 
> After that, we have
> const struct dentry *parent = READ_ONCE(dentry->d_parent);
> if (dentry == mnt->mnt.mnt_root) {
> /* assume this is false */
> }
> ...
> prefetch(parent);
>        if (!prepend_name(p, &dentry->d_name))
>                break;
>        dentry = parent;
> 
> So the prepend_name(p, &dentry->d_name) is actually from mnt->mnt_mountpoint.

I am not quite following the question. In the code below:

               if (dentry == mnt->mnt.mnt_root) {
                       struct mount *m = READ_ONCE(mnt->mnt_parent);
                       struct mnt_namespace *mnt_ns;

                       if (likely(mnt != m)) {
                               dentry = READ_ONCE(mnt->mnt_mountpoint);
                               mnt = m;
                               continue;
/* We either continue, here */

                       }
                       /* Global root */
                       mnt_ns = READ_ONCE(mnt->mnt_ns);
                       /* open-coded is_mounted() to use local mnt_ns */
                       if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns))
                               return 1;       // absolute root
                       else
                               return 2;       // detached or not attached yet
/* Or return here */
               }

So we will not hit prepend_name(). Does this answer the 
question?

> 
> In your above code, maybe we should return path->dentry in the below if statement?
> 
>        if (unlikely(path->dentry == path->mnt->mnt_root)) {
>                struct path new_path;
> 
>                if (!choose_mountpoint(real_mount(path->mnt),
>                                       root, &new_path))
>                        goto in_root;
>                path_put(path);
>                *path = new_path;
>                if (unlikely(flags & LOOKUP_NO_XDEV))
>                        return ERR_PTR(-EXDEV);
> + return path->dentry;
>        }
>        /* rare case of legitimate dget_parent()... */
>        return dget_parent(path->dentry);
> 
> Also, could you add some selftests cross mount points? This will
> have more coverages with __path_walk_parent().

Yeah, I will try to add more tests in the next revision. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent()
  2025-07-04  9:00       ` Mickaël Salaün
  2025-07-06 22:29         ` Song Liu
@ 2025-07-07 10:28         ` Christian Brauner
  1 sibling, 0 replies; 50+ messages in thread
From: Christian Brauner @ 2025-07-07 10:28 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Song Liu, Song Liu, bpf@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, m@maowtm.org, neil@brown.name,
	Günther Noack, Jann Horn

On Fri, Jul 04, 2025 at 11:00:37AM +0200, Mickaël Salaün wrote:
> On Thu, Jul 03, 2025 at 10:27:02PM +0000, Song Liu wrote:
> > Hi Mickaël,
> > 
> > > On Jul 3, 2025, at 11:29 AM, Mickaël Salaün <mic@digikod.net> wrote:
> > > 
> > > On Mon, Jun 16, 2025 at 11:11:13PM -0700, Song Liu wrote:
> > >> Use path_walk_parent() to walk a path up to its parent.
> > >> 
> > >> No functional changes intended.
> > > 
> > > Using this helper actualy fixes the issue highlighted by Al.  Even if it
> > > was reported after the first version of this patch series, the issue
> > > should be explained in the commit message and these tags should be
> > > added:
> > > 
> > > Reported-by: Al Viro <viro@zeniv.linux.org.uk>
> > > Closes: https://lore.kernel.org/r/20250529231018.GP2023217@ZenIV 
> > > Fixes: cb2c7d1a1776 ("landlock: Support filesystem access-control")
> > > 
> > > I like this new helper but we should have a clear plan to be able to
> > > call such helper in a RCU read-side critical section before we merge
> > > this series.  We're still waiting for Christian.
> > > 
> > > I sent a patch to fix the handling of disconnected directories for
> > > Landlock, and it will need to be backported:
> > > https://lore.kernel.org/all/20250701183812.3201231-1-mic@digikod.net/
> > > Unfortunately a rebase would be needed for the path_walk_parent patch,
> > > but I can take it in my tree if everyone is OK.
> > 
> > The fix above also touches VFS code (makes path_connected available 
> > out of namei.c. It probably should also go through VFS tree? 
> > 
> > Maybe you can send 1/5 and 2/5 of this set (with necessary changes) 
> > and your fix together to VFS tree. Then, I will see how to route the
> > BPF side patches. 
> 
> That could work, but because it would be much more Landlock-specific
> code than VFS-specific code, and there will probably be a few versions
> of my fixes, I'd prefer to keep this into my tree if VFS folks are OK.
> BTW, my fixes already touch the VFS subsystem a bit.

Under specific circumstances we will accept very minor changes to VFS
code to go through selected other trees depending on the amount of trust
between the respective trees. Afaict, your series just exports a
function. I'll take a look at it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26  1:05             ` NeilBrown
  2025-06-26  5:52               ` Song Liu
@ 2025-07-07 10:43               ` Christian Brauner
  1 sibling, 0 replies; 50+ messages in thread
From: Christian Brauner @ 2025-07-07 10:43 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tingmao Wang, Mickaël Salaün, Song Liu, bpf,
	linux-fsdevel, linux-kernel, linux-security-module, kernel-team,
	andrii, eddyz87, ast, daniel, martin.lau, viro, jack, kpsingh,
	mattbobrowski, Günther Noack

> Those are implementation details internal to namei.c.  Certainly this
> function wouldn't use all of the fields in nameidata, but it doesn't
> hurt to have a few fields in a struct on the stack which don't get used.
> Keeping the code simple and uniform is much more important.  Using

Exactly.

> Certainly vfs_walk_ancestors() would fallback to ref-walk if rcu-walk
> resulted in -ECHILD - just like all other path walking code in namei.c.
> This would be largely transparent to the caller - the caller would only
> see that the callback received a NULL path indicating a restart.  It
> wouldn't need to know why.

Yes, that's also what I mentioned in an earlier mail.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-06-26 22:51                     ` NeilBrown
  2025-06-27  0:21                       ` Song Liu
@ 2025-07-07 10:46                       ` Christian Brauner
  2025-07-07 11:17                         ` Christian Brauner
  1 sibling, 1 reply; 50+ messages in thread
From: Christian Brauner @ 2025-07-07 10:46 UTC (permalink / raw)
  To: NeilBrown
  Cc: Song Liu, Tingmao Wang, Mickaël Salaün, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack

On Fri, Jun 27, 2025 at 08:51:21AM +1000, NeilBrown wrote:
> On Fri, 27 Jun 2025, Song Liu wrote:
> > 
> > 
> > > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote:
> > 
> > [...]
> > 
> > >> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> > >> initially, so some clarification:
> > >> 
> > >> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> > >> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> > >> A reliable step-by-step ref-walk, like this set, works well with 
> > >> BPF, and we want to keep it.
> > > 
> > > The distinction between rcu-walk and ref-walk is an internal
> > > implementation detail.  You as a caller shouldn't need to think about
> > > the difference.  You just want to walk.  Note that LOOKUP_RCU is
> > > documented in namei.h as "semi-internal".  The only uses outside of
> > > core-VFS code is in individual filesystem's d_revalidate handler - they
> > > are checking if they are allowed to sleep or not.  You should never
> > > expect to pass LOOKUP_RCU to an VFS API - no other code does.
> > > 
> > > It might be reasonable for you as a caller to have some control over
> > > whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
> > > But for dotdot lookup the code will never sleep - so that is not
> > > relevant.
> > 
> > Unfortunately, the BPF use case is more complicated. In some cases, 
> > the callback function cannot be call in rcu critical sections. For 
> > example, the callback may need to read xatter. For these cases, we
> > we cannot use RCU walk at all. 
> 
> I really think you should stop using the terms RCU walk and ref-walk.  I
> think they might be focusing your thinking in an unhelpful direction.

Thank you! I really appreciate you helping to shape this API and it
aligns a lot with my thinking.

> The key issue about reading xattrs is that it might need to sleep.
> Focusing on what might need to sleep and what will never need to sleep
> is a useful approach - the distinction is wide spread in the kernel and
> several function take a flag indicating if they are permitted to sleep,
> or if failure when sleeping would be required.
> 
> So your above observation is better described as 
> 
>    The vfs_walk_ancestors() API has an (implicit) requirement that the
>    callback mustn't sleep.  This is a problem for some use-cases
>    where the call back might need to sleep - e.g. for accessing xattrs.
> 
> That is a good and useful observation.  I can see three possibly
> responses:
> 
> 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
>    always allowed to sleep.  I don't particularly like this approach.

Agreed.

> 
> 2/ Use repeated calls to vfs_walk_parent() when the handling of each
>    ancestor might need to sleep.  I see no problem with supporting both
>    vfs_walk_ancestors() and vfs_walk_parent().  There is plenty of
>    precedent for having different  interfaces for different use cases.

Meh.

> 
> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.

I think that's fine.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-07 10:46                       ` Christian Brauner
@ 2025-07-07 11:17                         ` Christian Brauner
  2025-07-07 18:50                           ` Song Liu
  0 siblings, 1 reply; 50+ messages in thread
From: Christian Brauner @ 2025-07-07 11:17 UTC (permalink / raw)
  To: NeilBrown
  Cc: Song Liu, Tingmao Wang, Mickaël Salaün, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack

On Mon, Jul 07, 2025 at 12:46:41PM +0200, Christian Brauner wrote:
> On Fri, Jun 27, 2025 at 08:51:21AM +1000, NeilBrown wrote:
> > On Fri, 27 Jun 2025, Song Liu wrote:
> > > 
> > > 
> > > > On Jun 26, 2025, at 3:22 AM, NeilBrown <neil@brown.name> wrote:
> > > 
> > > [...]
> > > 
> > > >> I guess I misunderstood the proposal of vfs_walk_ancestors() 
> > > >> initially, so some clarification:
> > > >> 
> > > >> I think vfs_walk_ancestors() is good for the rcu-walk, and some 
> > > >> rcu-then-ref-walk. However, I don’t think it fits all use cases. 
> > > >> A reliable step-by-step ref-walk, like this set, works well with 
> > > >> BPF, and we want to keep it.
> > > > 
> > > > The distinction between rcu-walk and ref-walk is an internal
> > > > implementation detail.  You as a caller shouldn't need to think about
> > > > the difference.  You just want to walk.  Note that LOOKUP_RCU is
> > > > documented in namei.h as "semi-internal".  The only uses outside of
> > > > core-VFS code is in individual filesystem's d_revalidate handler - they
> > > > are checking if they are allowed to sleep or not.  You should never
> > > > expect to pass LOOKUP_RCU to an VFS API - no other code does.
> > > > 
> > > > It might be reasonable for you as a caller to have some control over
> > > > whether the call can sleep or not.  LOOKUP_CACHED is a bit like that.
> > > > But for dotdot lookup the code will never sleep - so that is not
> > > > relevant.
> > > 
> > > Unfortunately, the BPF use case is more complicated. In some cases, 
> > > the callback function cannot be call in rcu critical sections. For 
> > > example, the callback may need to read xatter. For these cases, we
> > > we cannot use RCU walk at all. 
> > 
> > I really think you should stop using the terms RCU walk and ref-walk.  I
> > think they might be focusing your thinking in an unhelpful direction.
> 
> Thank you! I really appreciate you helping to shape this API and it
> aligns a lot with my thinking.
> 
> > The key issue about reading xattrs is that it might need to sleep.
> > Focusing on what might need to sleep and what will never need to sleep
> > is a useful approach - the distinction is wide spread in the kernel and
> > several function take a flag indicating if they are permitted to sleep,
> > or if failure when sleeping would be required.
> > 
> > So your above observation is better described as 
> > 
> >    The vfs_walk_ancestors() API has an (implicit) requirement that the
> >    callback mustn't sleep.  This is a problem for some use-cases
> >    where the call back might need to sleep - e.g. for accessing xattrs.
> > 
> > That is a good and useful observation.  I can see three possibly
> > responses:
> > 
> > 1/ Add a vfs_walk_ancestors_maysleep() API for which the callback is
> >    always allowed to sleep.  I don't particularly like this approach.
> 
> Agreed.
> 
> > 
> > 2/ Use repeated calls to vfs_walk_parent() when the handling of each
> >    ancestor might need to sleep.  I see no problem with supporting both
> >    vfs_walk_ancestors() and vfs_walk_parent().  There is plenty of
> >    precedent for having different  interfaces for different use cases.
> 
> Meh.
> 
> > 
> > 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> 
> I think that's fine.

Ok, sorry for the delay but there's a lot of different things going on
right now and this one isn't exactly an easy thing to solve.

I mentioned this before and so did Neil: the lookup implementation
supports two modes sleeping and non-sleeping. That api is abstracted
away as heavily as possible by the VFS so that non-core code will not be
exposed to it other than in exceptional circumstances and doesn't have
to care about it.

It is a conceptual dead-end to expose these two modes via separate APIs
and leak this implementation detail into non-core code. It will not
happen as far as I'm concerned.

I very much understand the urge to get the refcount step-by-step thing
merged asap. Everyone wants their APIs merged fast. And if it's
reasonable to move fast we will (see the kernfs xattr thing).

But here are two use-cases that ask for the same thing with different
constraints that closely mirror our unified approach. Merging one
quickly just to have something and then later bolting the other one on
top, augmenting, or replacing, possible having to deprecate the old API
is just objectively nuts. That's how we end up with a spaghetthi helper
collection. We want as little helper fragmentation as possible.

We need a unified API that serves both use-cases. I dislike
callback-based APIs generally but we have precedent in the VFS for this
for cases where the internal state handling is delicate enough that it
should not be exposed (see __iterate_supers() which does exactly work
like Neil suggested down to the flag argument itself I added).

So I'm open to the callback solution.

(Note for really absurd perf requirements you could even make it work
with static calls I'm pretty sure.)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent()
  2025-07-06 23:54     ` Song Liu
@ 2025-07-07 17:53       ` Yonghong Song
  0 siblings, 0 replies; 50+ messages in thread
From: Yonghong Song @ 2025-07-07 17:53 UTC (permalink / raw)
  To: Song Liu
  Cc: Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	kpsingh@kernel.org, mattbobrowski@google.com, m@maowtm.org,
	neil@brown.name



On 7/6/25 4:54 PM, Song Liu wrote:
>
>> On Jul 4, 2025, at 10:40 AM, Yonghong Song <yonghong.song@linux.dev> wrote:
> [...]
>>> +static struct dentry *__path_walk_parent(struct path *path, const struct path *root, int flags)
>>>   {
>>> - struct dentry *parent;
>>> -
>>> - if (path_equal(&nd->path, &nd->root))
>>> + if (path_equal(path, root))
>>>    goto in_root;
>>> - if (unlikely(nd->path.dentry == nd->path.mnt->mnt_root)) {
>>> - struct path path;
>>> + if (unlikely(path->dentry == path->mnt->mnt_root)) {
>>> + struct path new_path;
>>>   - if (!choose_mountpoint(real_mount(nd->path.mnt),
>>> -       &nd->root, &path))
>>> + if (!choose_mountpoint(real_mount(path->mnt),
>>> +       root, &new_path))
>>>    goto in_root;
>>> - path_put(&nd->path);
>>> - nd->path = path;
>>> - nd->inode = path.dentry->d_inode;
>>> - if (unlikely(nd->flags & LOOKUP_NO_XDEV))
>>> + path_put(path);
>>> + *path = new_path;
>>> + if (unlikely(flags & LOOKUP_NO_XDEV))
>>>    return ERR_PTR(-EXDEV);
>>>    }
>>>    /* rare case of legitimate dget_parent()... */
>>> - parent = dget_parent(nd->path.dentry);
>>> + return dget_parent(path->dentry);
>> I have some confusion with this patch when crossing mount boundary.
>>
>> In d_path.c, we have
>>
>> static int __prepend_path(const struct dentry *dentry, const struct mount *mnt,
>>                           const struct path *root, struct prepend_buffer *p)
>> {
>>         while (dentry != root->dentry || &mnt->mnt != root->mnt) {
>>                 const struct dentry *parent = READ_ONCE(dentry->d_parent);
>>
>>                 if (dentry == mnt->mnt.mnt_root) {
>>                         struct mount *m = READ_ONCE(mnt->mnt_parent);
>>                         struct mnt_namespace *mnt_ns;
>>
>>                         if (likely(mnt != m)) {
>>                                 dentry = READ_ONCE(mnt->mnt_mountpoint);
>>                                 mnt = m;
>>                                 continue;
>>                         }
>>                         /* Global root */
>>                         mnt_ns = READ_ONCE(mnt->mnt_ns);
>>                         /* open-coded is_mounted() to use local mnt_ns */
>>                         if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns))
>>                                 return 1;       // absolute root
>>                         else
>>                                 return 2;       // detached or not attached yet
>>                 }
>>
>>                 if (unlikely(dentry == parent))
>>                         /* Escaped? */
>>                         return 3;
>>
>>                 prefetch(parent);
>>                 if (!prepend_name(p, &dentry->d_name))
>>                         break;
>>                 dentry = parent;
>>         }
>>         return 0;
>> }
>>
>> At the mount boundary and not at root mount, the code has
>> dentry = READ_ONCE(mnt->mnt_mountpoint);
>> mnt = m; /* 'mnt' will be parent mount */
>> continue;
>>
>> After that, we have
>> const struct dentry *parent = READ_ONCE(dentry->d_parent);
>> if (dentry == mnt->mnt.mnt_root) {
>> /* assume this is false */
>> }
>> ...
>> prefetch(parent);
>>         if (!prepend_name(p, &dentry->d_name))
>>                 break;
>>         dentry = parent;
>>
>> So the prepend_name(p, &dentry->d_name) is actually from mnt->mnt_mountpoint.
> I am not quite following the question. In the code below:
>
>                 if (dentry == mnt->mnt.mnt_root) {
>                         struct mount *m = READ_ONCE(mnt->mnt_parent);
>                         struct mnt_namespace *mnt_ns;
>
>                         if (likely(mnt != m)) {
>                                 dentry = READ_ONCE(mnt->mnt_mountpoint);
>                                 mnt = m;
>                                 continue;
> /* We either continue, here */
>
>                         }
>                         /* Global root */
>                         mnt_ns = READ_ONCE(mnt->mnt_ns);
>                         /* open-coded is_mounted() to use local mnt_ns */
>                         if (!IS_ERR_OR_NULL(mnt_ns) && !is_anon_ns(mnt_ns))
>                                 return 1;       // absolute root
>                         else
>                                 return 2;       // detached or not attached yet
> /* Or return here */
>                 }
>
> So we will not hit prepend_name(). Does this answer the
> question?
>
>> In your above code, maybe we should return path->dentry in the below if statement?
>>
>>         if (unlikely(path->dentry == path->mnt->mnt_root)) {
>>                 struct path new_path;
>>
>>                 if (!choose_mountpoint(real_mount(path->mnt),
>>                                        root, &new_path))
>>                         goto in_root;
>>                 path_put(path);
>>                 *path = new_path;
>>                 if (unlikely(flags & LOOKUP_NO_XDEV))
>>                         return ERR_PTR(-EXDEV);
>> + return path->dentry;
>>         }
>>         /* rare case of legitimate dget_parent()... */
>>         return dget_parent(path->dentry);
>>
>> Also, could you add some selftests cross mount points? This will
>> have more coverages with __path_walk_parent().

Looks like __path_walk_parent() works for the root of mounted fs.
If this is the case, the implementation is correct. It could be
good to add some comments to clarify.

> Yeah, I will try to add more tests in the next revision.
>
> Thanks,
> Song
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-07 11:17                         ` Christian Brauner
@ 2025-07-07 18:50                           ` Song Liu
  2025-07-09 16:06                             ` Mickaël Salaün
  2025-07-09 22:14                             ` NeilBrown
  0 siblings, 2 replies; 50+ messages in thread
From: Song Liu @ 2025-07-07 18:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: NeilBrown, Tingmao Wang, Mickaël Salaün, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack

Hi Christian, 

Thanks for your comments! 

> On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:

[...]

>>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
>> 
>> I think that's fine.
> 
> Ok, sorry for the delay but there's a lot of different things going on
> right now and this one isn't exactly an easy thing to solve.
> 
> I mentioned this before and so did Neil: the lookup implementation
> supports two modes sleeping and non-sleeping. That api is abstracted
> away as heavily as possible by the VFS so that non-core code will not be
> exposed to it other than in exceptional circumstances and doesn't have
> to care about it.
> 
> It is a conceptual dead-end to expose these two modes via separate APIs
> and leak this implementation detail into non-core code. It will not
> happen as far as I'm concerned.
> 
> I very much understand the urge to get the refcount step-by-step thing
> merged asap. Everyone wants their APIs merged fast. And if it's
> reasonable to move fast we will (see the kernfs xattr thing).
> 
> But here are two use-cases that ask for the same thing with different
> constraints that closely mirror our unified approach. Merging one
> quickly just to have something and then later bolting the other one on
> top, augmenting, or replacing, possible having to deprecate the old API
> is just objectively nuts. That's how we end up with a spaghetthi helper
> collection. We want as little helper fragmentation as possible.
> 
> We need a unified API that serves both use-cases. I dislike
> callback-based APIs generally but we have precedent in the VFS for this
> for cases where the internal state handling is delicate enough that it
> should not be exposed (see __iterate_supers() which does exactly work
> like Neil suggested down to the flag argument itself I added).
> 
> So I'm open to the callback solution.
> 
> (Note for really absurd perf requirements you could even make it work
> with static calls I'm pretty sure.)

I guess we will go with Mickaël’s idea:

> int vfs_walk_ancestors(struct path *path,
>                       bool (*walk_cb)(const struct path *ancestor, void *data),
>                       void *data, int flags)
> 
> The walk continue while walk_cb() returns true.  walk_cb() can then
> check if @ancestor is equal to a @root, or other properties.  The
> walk_cb() return value (if not bool) should not be returned by
> vfs_walk_ancestors() because a walk stop doesn't mean an error.

If necessary, we hide “root" inside @data. This is good. 

> @path would be updated with latest ancestor path (e.g. @root).

Update @path to the last ancestor and hold proper references. 
I missed this part earlier. With this feature, vfs_walk_ancestors 
should work usable with open-codeed bpf path iterator. 

I have a question about this behavior with RCU walk. IIUC, RCU 
walk does not hold reference to @ancestor when calling walk_cb().
If walk_cb() returns false, shall vfs_walk_ancestors() then
grab a reference on @ancestor? This feels a bit weird to me. 
Maybe “updating @path to the last ancestor” should only apply to
LOOKUP_RCU==false case? 

> @flags could contain LOOKUP_RCU or not, which enables us to have
> walk_cb() not-RCU compatible.
> 
> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> failed with -ECHILD, the caller can restart the walk by calling
> vfs_walk_ancestors() again but without LOOKUP_RCU.


Given we want callers to handle -ECHILD and call vfs_walk_ancestors
again without LOOKUP_RCU, I think we should keep @path not changed
With LOOKUP_RCU==true, and only update it to the last ancestor 
when LOOKUP_RCU==false. 

With this behavior, landlock code will be like:


/* Assume we hold reference on “path”. 
 * With LOOKUP_RCU, path will not change, we don’t need 
 * extra reference on “path”.
 */
err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
/* 
 * At this point, whether err is 0 or not, path is not 
 * changed.
 */

if (err == -ECHILD) {
	struct path walk_path = *path;

	/* reset any data changed by the walk */
	reset_data(data);

	/* get a reference on walk_path. */
	path_get(&walk_path);

	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
	/* Now, walk_path might be updated */

	/* Always release reference on walk_path */
	path_put(&walk_path);
}


BPF path iterator sode will look like:

static bool bpf_cb(const struct path *ancestor, void *data)
{
	return false;
}

struct path *bpf_iter_path_next(struct bpf_iter_path *it)
{
	struct bpf_iter_path_kern *kit = (void *)it;

	if (vfs_walk_ancestors(&kit->path, bpf_cb, NULL))
		return NULL;
	return &kit->path;
}


Does this sound reasonable to every body?

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-07 18:50                           ` Song Liu
@ 2025-07-09 16:06                             ` Mickaël Salaün
  2025-07-09 17:31                               ` Song Liu
  2025-07-09 22:14                             ` NeilBrown
  1 sibling, 1 reply; 50+ messages in thread
From: Mickaël Salaün @ 2025-07-09 16:06 UTC (permalink / raw)
  To: Song Liu
  Cc: Christian Brauner, NeilBrown, Tingmao Wang, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack, Jann Horn

On Mon, Jul 07, 2025 at 06:50:12PM +0000, Song Liu wrote:
> Hi Christian, 
> 
> Thanks for your comments! 
> 
> > On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:
> 
> [...]
> 
> >>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> >> 
> >> I think that's fine.
> > 
> > Ok, sorry for the delay but there's a lot of different things going on
> > right now and this one isn't exactly an easy thing to solve.
> > 
> > I mentioned this before and so did Neil: the lookup implementation
> > supports two modes sleeping and non-sleeping. That api is abstracted
> > away as heavily as possible by the VFS so that non-core code will not be
> > exposed to it other than in exceptional circumstances and doesn't have
> > to care about it.
> > 
> > It is a conceptual dead-end to expose these two modes via separate APIs
> > and leak this implementation detail into non-core code. It will not
> > happen as far as I'm concerned.
> > 
> > I very much understand the urge to get the refcount step-by-step thing
> > merged asap. Everyone wants their APIs merged fast. And if it's
> > reasonable to move fast we will (see the kernfs xattr thing).
> > 
> > But here are two use-cases that ask for the same thing with different
> > constraints that closely mirror our unified approach. Merging one
> > quickly just to have something and then later bolting the other one on
> > top, augmenting, or replacing, possible having to deprecate the old API
> > is just objectively nuts. That's how we end up with a spaghetthi helper
> > collection. We want as little helper fragmentation as possible.
> > 
> > We need a unified API that serves both use-cases. I dislike
> > callback-based APIs generally but we have precedent in the VFS for this
> > for cases where the internal state handling is delicate enough that it
> > should not be exposed (see __iterate_supers() which does exactly work
> > like Neil suggested down to the flag argument itself I added).
> > 
> > So I'm open to the callback solution.
> > 
> > (Note for really absurd perf requirements you could even make it work
> > with static calls I'm pretty sure.)
> 
> I guess we will go with Mickaël’s idea:
> 
> > int vfs_walk_ancestors(struct path *path,
> >                       bool (*walk_cb)(const struct path *ancestor, void *data),

> >                       void *data, int flags)
> > 
> > The walk continue while walk_cb() returns true.  walk_cb() can then
> > check if @ancestor is equal to a @root, or other properties.  The
> > walk_cb() return value (if not bool) should not be returned by
> > vfs_walk_ancestors() because a walk stop doesn't mean an error.
> 
> If necessary, we hide “root" inside @data. This is good. 
> 
> > @path would be updated with latest ancestor path (e.g. @root).
> 
> Update @path to the last ancestor and hold proper references. 
> I missed this part earlier. With this feature, vfs_walk_ancestors 
> should work usable with open-codeed bpf path iterator. 
> 
> I have a question about this behavior with RCU walk. IIUC, RCU 
> walk does not hold reference to @ancestor when calling walk_cb().

I think a reference to the mount should be held, but not necessarily to
the dentry if we are still in the same mount as the original path.

> If walk_cb() returns false, shall vfs_walk_ancestors() then
> grab a reference on @ancestor? This feels a bit weird to me. 

If walk_cb() checks for a root, it will return false when the path will
match, and the caller would expect to get this root path, right?

In general, it's safer to always have the same behavior when holding or
releasing a reference.  I think the caller should then always call
path_put() after vfs_walk_ancestors() whatever the return code is.

> Maybe “updating @path to the last ancestor” should only apply to
> LOOKUP_RCU==false case? 
> 
> > @flags could contain LOOKUP_RCU or not, which enables us to have
> > walk_cb() not-RCU compatible.
> > 
> > When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> > failed with -ECHILD, the caller can restart the walk by calling
> > vfs_walk_ancestors() again but without LOOKUP_RCU.
> 
> 
> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
> again without LOOKUP_RCU, I think we should keep @path not changed
> With LOOKUP_RCU==true, and only update it to the last ancestor 
> when LOOKUP_RCU==false. 

As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public
flag.  Instead, walk_cb() should never sleep (and then potentially be
called under RCU by the vfs_walk_ancestors() implementation).

> 
> With this behavior, landlock code will be like:
> 
> 
> /* Assume we hold reference on “path”. 
>  * With LOOKUP_RCU, path will not change, we don’t need 
>  * extra reference on “path”.
>  */
> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
> /* 
>  * At this point, whether err is 0 or not, path is not 
>  * changed.
>  */
> 
> if (err == -ECHILD) {
> 	struct path walk_path = *path;
> 
> 	/* reset any data changed by the walk */
> 	reset_data(data);
> 
> 	/* get a reference on walk_path. */
> 	path_get(&walk_path);
> 
> 	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
> 	/* Now, walk_path might be updated */
> 
> 	/* Always release reference on walk_path */
> 	path_put(&walk_path);
> }
> 
> 
> BPF path iterator sode will look like:
> 
> static bool bpf_cb(const struct path *ancestor, void *data)
> {
> 	return false;
> }

Instead of this callback, we could just always return if walk_cb is
NULL.

> 
> struct path *bpf_iter_path_next(struct bpf_iter_path *it)
> {
> 	struct bpf_iter_path_kern *kit = (void *)it;
> 
> 	if (vfs_walk_ancestors(&kit->path, bpf_cb, NULL))
> 		return NULL;
> 	return &kit->path;
> }
> 
> 
> Does this sound reasonable to every body?
> 
> Thanks,
> Song
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-09 16:06                             ` Mickaël Salaün
@ 2025-07-09 17:31                               ` Song Liu
  2025-07-09 22:24                                 ` NeilBrown
  0 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-07-09 17:31 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Christian Brauner, NeilBrown, Tingmao Wang, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack, Jann Horn



> On Jul 9, 2025, at 9:06 AM, Mickaël Salaün <mic@digikod.net> wrote:\

[...]

>> If necessary, we hide “root" inside @data. This is good. 
>> 
>>> @path would be updated with latest ancestor path (e.g. @root).
>> 
>> Update @path to the last ancestor and hold proper references. 
>> I missed this part earlier. With this feature, vfs_walk_ancestors 
>> should work usable with open-codeed bpf path iterator. 
>> 
>> I have a question about this behavior with RCU walk. IIUC, RCU 
>> walk does not hold reference to @ancestor when calling walk_cb().
> 
> I think a reference to the mount should be held, but not necessarily to
> the dentry if we are still in the same mount as the original path.

If we update @path and do path_put() after the walk, we have to hold 
reference to both the mnt and the dentry, no? 

> 
>> If walk_cb() returns false, shall vfs_walk_ancestors() then
>> grab a reference on @ancestor? This feels a bit weird to me.
> 
> If walk_cb() checks for a root, it will return false when the path will
> match, and the caller would expect to get this root path, right?

If the user want to walk to the global root, walk_cb() may not 
return false at all, IIUC. walk_cb() may also return false on 
other conditions. 

> 
> In general, it's safer to always have the same behavior when holding or
> releasing a reference.  I think the caller should then always call
> path_put() after vfs_walk_ancestors() whatever the return code is.
> 
>> Maybe “updating @path to the last ancestor” should only apply to
>> LOOKUP_RCU==false case? 
>> 
>>> @flags could contain LOOKUP_RCU or not, which enables us to have
>>> walk_cb() not-RCU compatible.
>>> 
>>> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
>>> failed with -ECHILD, the caller can restart the walk by calling
>>> vfs_walk_ancestors() again but without LOOKUP_RCU.
>> 
>> 
>> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
>> again without LOOKUP_RCU, I think we should keep @path not changed
>> With LOOKUP_RCU==true, and only update it to the last ancestor 
>> when LOOKUP_RCU==false.
> 
> As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public
> flag.  Instead, walk_cb() should never sleep (and then potentially be
> called under RCU by the vfs_walk_ancestors() implementation).

How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
following code in landlocked:

/* Try RCU walk first */
err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);

if (err == -ECHILD) {
	struct path walk_path = *path;

	/* reset any data changed by the walk */
	reset_data(data);
	
	/* now do ref-walk */
	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
}

Or do you mean vfs_walk_ancestors will never return -ECHILD?
Then we need vfs_walk_ancestors to call reset_data logic, right?

Thanks,
Song



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-07 18:50                           ` Song Liu
  2025-07-09 16:06                             ` Mickaël Salaün
@ 2025-07-09 22:14                             ` NeilBrown
  2025-07-09 22:41                               ` Song Liu
  1 sibling, 1 reply; 50+ messages in thread
From: NeilBrown @ 2025-07-09 22:14 UTC (permalink / raw)
  To: Song Liu
  Cc: Christian Brauner, Tingmao Wang, Mickaël Salaün,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack

On Tue, 08 Jul 2025, Song Liu wrote:
> Hi Christian, 
> 
> Thanks for your comments! 
> 
> > On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:
> 
> [...]
> 
> >>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> >> 
> >> I think that's fine.
> > 
> > Ok, sorry for the delay but there's a lot of different things going on
> > right now and this one isn't exactly an easy thing to solve.
> > 
> > I mentioned this before and so did Neil: the lookup implementation
> > supports two modes sleeping and non-sleeping. That api is abstracted
> > away as heavily as possible by the VFS so that non-core code will not be
> > exposed to it other than in exceptional circumstances and doesn't have
> > to care about it.
> > 
> > It is a conceptual dead-end to expose these two modes via separate APIs
> > and leak this implementation detail into non-core code. It will not
> > happen as far as I'm concerned.
> > 
> > I very much understand the urge to get the refcount step-by-step thing
> > merged asap. Everyone wants their APIs merged fast. And if it's
> > reasonable to move fast we will (see the kernfs xattr thing).
> > 
> > But here are two use-cases that ask for the same thing with different
> > constraints that closely mirror our unified approach. Merging one
> > quickly just to have something and then later bolting the other one on
> > top, augmenting, or replacing, possible having to deprecate the old API
> > is just objectively nuts. That's how we end up with a spaghetthi helper
> > collection. We want as little helper fragmentation as possible.
> > 
> > We need a unified API that serves both use-cases. I dislike
> > callback-based APIs generally but we have precedent in the VFS for this
> > for cases where the internal state handling is delicate enough that it
> > should not be exposed (see __iterate_supers() which does exactly work
> > like Neil suggested down to the flag argument itself I added).
> > 
> > So I'm open to the callback solution.
> > 
> > (Note for really absurd perf requirements you could even make it work
> > with static calls I'm pretty sure.)
> 
> I guess we will go with Mickaël’s idea:
> 
> > int vfs_walk_ancestors(struct path *path,
> >                       bool (*walk_cb)(const struct path *ancestor, void *data),
> >                       void *data, int flags)
> > 
> > The walk continue while walk_cb() returns true.  walk_cb() can then
> > check if @ancestor is equal to a @root, or other properties.  The
> > walk_cb() return value (if not bool) should not be returned by
> > vfs_walk_ancestors() because a walk stop doesn't mean an error.
> 
> If necessary, we hide “root" inside @data. This is good. 
> 
> > @path would be updated with latest ancestor path (e.g. @root).
> 
> Update @path to the last ancestor and hold proper references. 
> I missed this part earlier. With this feature, vfs_walk_ancestors 
> should work usable with open-codeed bpf path iterator. 

I don't think path should be updated.  That adds complexity which might
not be needed.  The original (landlock) requirements were only to look
at each ancestor, not to take a reference to any of them.

If the caller needs a reference to any of the ancestors I think that
walk_cb() needs to take that reference and store it in data.
Note that attempting to take the reference might fail.  See
legitimize_path() in fs/namei.c.

It isn't yet clear to me what would be a good API for requesting the
reference.
One option would be for vfs_walk_ancestors() to pass another void* to
walk_cb(), and it passed it on to vfs_legitimize_path() which extracts
the seq numbers from there.
Another might be that the path passed to walk_cb is always
nameidata.path, and so when that is passed to vfs_legitimize_path() path
it can use container_of() to find the seq numbers.

If vfs_legitimize_path() fail, walk_cb() might want to ask for the walk
to be restarted.

> 
> I have a question about this behavior with RCU walk. IIUC, RCU 
> walk does not hold reference to @ancestor when calling walk_cb().
> If walk_cb() returns false, shall vfs_walk_ancestors() then
> grab a reference on @ancestor? This feels a bit weird to me. 
> Maybe “updating @path to the last ancestor” should only apply to
> LOOKUP_RCU==false case? 
> 
> > @flags could contain LOOKUP_RCU or not, which enables us to have
> > walk_cb() not-RCU compatible.
> > 
> > When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> > failed with -ECHILD, the caller can restart the walk by calling
> > vfs_walk_ancestors() again but without LOOKUP_RCU.
> 
> 
> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
> again without LOOKUP_RCU, I think we should keep @path not changed
> With LOOKUP_RCU==true, and only update it to the last ancestor 
> when LOOKUP_RCU==false. 

No, we really don't want to pass a LOOKUP_RCU() flag to
vfs_walk_ancestors().
vfs_walk_ancestors() might choose to pass that flag to walk_cb().

NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-09 17:31                               ` Song Liu
@ 2025-07-09 22:24                                 ` NeilBrown
  2025-07-09 22:50                                   ` Song Liu
  0 siblings, 1 reply; 50+ messages in thread
From: NeilBrown @ 2025-07-09 22:24 UTC (permalink / raw)
  To: Song Liu
  Cc: Mickaël Salaün, Christian Brauner, Tingmao Wang,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack, Jann Horn

On Thu, 10 Jul 2025, Song Liu wrote:
> 
> 
> > On Jul 9, 2025, at 9:06 AM, Mickaël Salaün <mic@digikod.net> wrote:\
> 
> [...]
> 
> >> If necessary, we hide “root" inside @data. This is good. 
> >> 
> >>> @path would be updated with latest ancestor path (e.g. @root).
> >> 
> >> Update @path to the last ancestor and hold proper references. 
> >> I missed this part earlier. With this feature, vfs_walk_ancestors 
> >> should work usable with open-codeed bpf path iterator. 
> >> 
> >> I have a question about this behavior with RCU walk. IIUC, RCU 
> >> walk does not hold reference to @ancestor when calling walk_cb().
> > 
> > I think a reference to the mount should be held, but not necessarily to
> > the dentry if we are still in the same mount as the original path.
> 
> If we update @path and do path_put() after the walk, we have to hold 
> reference to both the mnt and the dentry, no? 
> 
> > 
> >> If walk_cb() returns false, shall vfs_walk_ancestors() then
> >> grab a reference on @ancestor? This feels a bit weird to me.
> > 
> > If walk_cb() checks for a root, it will return false when the path will
> > match, and the caller would expect to get this root path, right?
> 
> If the user want to walk to the global root, walk_cb() may not 
> return false at all, IIUC. walk_cb() may also return false on 
> other conditions. 
> 
> > 
> > In general, it's safer to always have the same behavior when holding or
> > releasing a reference.  I think the caller should then always call
> > path_put() after vfs_walk_ancestors() whatever the return code is.
> > 
> >> Maybe “updating @path to the last ancestor” should only apply to
> >> LOOKUP_RCU==false case? 
> >> 
> >>> @flags could contain LOOKUP_RCU or not, which enables us to have
> >>> walk_cb() not-RCU compatible.
> >>> 
> >>> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
> >>> failed with -ECHILD, the caller can restart the walk by calling
> >>> vfs_walk_ancestors() again but without LOOKUP_RCU.
> >> 
> >> 
> >> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
> >> again without LOOKUP_RCU, I think we should keep @path not changed
> >> With LOOKUP_RCU==true, and only update it to the last ancestor 
> >> when LOOKUP_RCU==false.
> > 
> > As Neil said, we don't want to explicitly pass LOOKUP_RCU as a public
> > flag.  Instead, walk_cb() should never sleep (and then potentially be
> > called under RCU by the vfs_walk_ancestors() implementation).
> 
> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
> following code in landlocked:
> 
> /* Try RCU walk first */
> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
> 
> if (err == -ECHILD) {
> 	struct path walk_path = *path;
> 
> 	/* reset any data changed by the walk */
> 	reset_data(data);
> 	
> 	/* now do ref-walk */
> 	err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
> }
> 
> Or do you mean vfs_walk_ancestors will never return -ECHILD?
> Then we need vfs_walk_ancestors to call reset_data logic, right?

It isn't clear to me that vfs_walk_ancestors() needs to return anything.
All the communication happens through walk_cb()

walk_cb() is called with a path, the data, and a "may_sleep" flag.
If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
which causes the walk to restart and use refcounts.
If it wants to stop, it returns 0.
If it wants to continue, it returns 1.
If it wants a reference to the path then it can use (new)
vfs_legitimize_path() which might fail.
If it wants a reference to the path and may_sleep is true, it can use
path_get() which won't fail.

When returning -ECHILD (either because of a need to sleep or because
vfs_legitimize_path() fails), walk_cb() would reset_data().

NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-09 22:14                             ` NeilBrown
@ 2025-07-09 22:41                               ` Song Liu
  2025-07-10  0:58                                 ` NeilBrown
  0 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-07-09 22:41 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Tingmao Wang, Mickaël Salaün,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack



> On Jul 9, 2025, at 3:14 PM, NeilBrown <neil@brown.name> wrote:
> 
> On Tue, 08 Jul 2025, Song Liu wrote:
>> Hi Christian, 
>> 
>> Thanks for your comments! 
>> 
>>> On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:
>> 
>> [...]
>> 
>>>>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
>>>> 
>>>> I think that's fine.
>>> 
>>> Ok, sorry for the delay but there's a lot of different things going on
>>> right now and this one isn't exactly an easy thing to solve.
>>> 
>>> I mentioned this before and so did Neil: the lookup implementation
>>> supports two modes sleeping and non-sleeping. That api is abstracted
>>> away as heavily as possible by the VFS so that non-core code will not be
>>> exposed to it other than in exceptional circumstances and doesn't have
>>> to care about it.
>>> 
>>> It is a conceptual dead-end to expose these two modes via separate APIs
>>> and leak this implementation detail into non-core code. It will not
>>> happen as far as I'm concerned.
>>> 
>>> I very much understand the urge to get the refcount step-by-step thing
>>> merged asap. Everyone wants their APIs merged fast. And if it's
>>> reasonable to move fast we will (see the kernfs xattr thing).
>>> 
>>> But here are two use-cases that ask for the same thing with different
>>> constraints that closely mirror our unified approach. Merging one
>>> quickly just to have something and then later bolting the other one on
>>> top, augmenting, or replacing, possible having to deprecate the old API
>>> is just objectively nuts. That's how we end up with a spaghetthi helper
>>> collection. We want as little helper fragmentation as possible.
>>> 
>>> We need a unified API that serves both use-cases. I dislike
>>> callback-based APIs generally but we have precedent in the VFS for this
>>> for cases where the internal state handling is delicate enough that it
>>> should not be exposed (see __iterate_supers() which does exactly work
>>> like Neil suggested down to the flag argument itself I added).
>>> 
>>> So I'm open to the callback solution.
>>> 
>>> (Note for really absurd perf requirements you could even make it work
>>> with static calls I'm pretty sure.)
>> 
>> I guess we will go with Mickaël’s idea:
>> 
>>> int vfs_walk_ancestors(struct path *path,
>>>                      bool (*walk_cb)(const struct path *ancestor, void *data),
>>>                      void *data, int flags)
>>> 
>>> The walk continue while walk_cb() returns true.  walk_cb() can then
>>> check if @ancestor is equal to a @root, or other properties.  The
>>> walk_cb() return value (if not bool) should not be returned by
>>> vfs_walk_ancestors() because a walk stop doesn't mean an error.
>> 
>> If necessary, we hide “root" inside @data. This is good. 
>> 
>>> @path would be updated with latest ancestor path (e.g. @root).
>> 
>> Update @path to the last ancestor and hold proper references. 
>> I missed this part earlier. With this feature, vfs_walk_ancestors 
>> should work usable with open-codeed bpf path iterator.
> 
> I don't think path should be updated.  That adds complexity which might
> not be needed.  The original (landlock) requirements were only to look
> at each ancestor, not to take a reference to any of them.

I think this is the ideal case that landlock wants in the long term. 
But we may need to take references when the attempt fails. Also, 
current landlock code takes reference at each step. 

> If the caller needs a reference to any of the ancestors I think that
> walk_cb() needs to take that reference and store it in data.
> Note that attempting to take the reference might fail.  See
> legitimize_path() in fs/namei.c.
> 
> It isn't yet clear to me what would be a good API for requesting the
> reference.
> One option would be for vfs_walk_ancestors() to pass another void* to
> walk_cb(), and it passed it on to vfs_legitimize_path() which extracts
> the seq numbers from there.
> Another might be that the path passed to walk_cb is always
> nameidata.path, and so when that is passed to vfs_legitimize_path() path
> it can use container_of() to find the seq numbers.

Letting walk_cb() call vfs_legitimize_path() seems suboptimal to me. 
I think the original goal is to have vfs_walk_ancestors() to:
  1. Try to walk the ancestors without taking any references;
  2. Detect when the not-taking-reference walk is not reliable;
  3. Maybe, retry the walk from beginning, but takes references on 
     each step. 

With walk_cb() calling vfs_legitimize_path(), we are moving #2 above 
to walk_cb(). I think this is not what we want? 

> If vfs_legitimize_path() fail, walk_cb() might want to ask for the walk
> to be restarted.
> 
>> 
>> I have a question about this behavior with RCU walk. IIUC, RCU 
>> walk does not hold reference to @ancestor when calling walk_cb().
>> If walk_cb() returns false, shall vfs_walk_ancestors() then
>> grab a reference on @ancestor? This feels a bit weird to me. 
>> Maybe “updating @path to the last ancestor” should only apply to
>> LOOKUP_RCU==false case? 
>> 
>>> @flags could contain LOOKUP_RCU or not, which enables us to have
>>> walk_cb() not-RCU compatible.
>>> 
>>> When passing LOOKUP_RCU, if the first call to vfs_walk_ancestors()
>>> failed with -ECHILD, the caller can restart the walk by calling
>>> vfs_walk_ancestors() again but without LOOKUP_RCU.
>> 
>> 
>> Given we want callers to handle -ECHILD and call vfs_walk_ancestors
>> again without LOOKUP_RCU, I think we should keep @path not changed
>> With LOOKUP_RCU==true, and only update it to the last ancestor 
>> when LOOKUP_RCU==false.
> 
> No, we really don't want to pass a LOOKUP_RCU() flag to
> vfs_walk_ancestors().
> vfs_walk_ancestors() might choose to pass that flag to walk_cb().

In this case, we will need vfs_walk_ancestors to handle #3 above. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-09 22:24                                 ` NeilBrown
@ 2025-07-09 22:50                                   ` Song Liu
  2025-07-10  0:58                                     ` NeilBrown
  0 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-07-09 22:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mickaël Salaün, Christian Brauner, Tingmao Wang,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack, Jann Horn



> On Jul 9, 2025, at 3:24 PM, NeilBrown <neil@brown.name> wrote:
[...]
>> 
>> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
>> following code in landlocked:
>> 
>> /* Try RCU walk first */
>> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
>> 
>> if (err == -ECHILD) {
>> struct path walk_path = *path;
>> 
>> /* reset any data changed by the walk */
>> reset_data(data);
>> 
>> /* now do ref-walk */
>> err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
>> }
>> 
>> Or do you mean vfs_walk_ancestors will never return -ECHILD?
>> Then we need vfs_walk_ancestors to call reset_data logic, right?
> 
> It isn't clear to me that vfs_walk_ancestors() needs to return anything.
> All the communication happens through walk_cb()
> 
> walk_cb() is called with a path, the data, and a "may_sleep" flag.
> If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
> which causes the walk to restart and use refcounts.
> If it wants to stop, it returns 0.
> If it wants to continue, it returns 1.
> If it wants a reference to the path then it can use (new)
> vfs_legitimize_path() which might fail.
> If it wants a reference to the path and may_sleep is true, it can use
> path_get() which won't fail.
> 
> When returning -ECHILD (either because of a need to sleep or because
> vfs_legitimize_path() fails), walk_cb() would reset_data().

This might actually work. 

My only concern is with vfs_legitimize_path. It is probably safer if 
we only allow taking references with may_sleep==true, so that path_get
won’t fail. In this case, we will not need walk_cb() to call 
vfs_legitimize_path. If the user want a reference, the walk_cb will 
first return -ECHILD, and call path_get when may_sleep is true. 

Does this make sense? Did I miss any cases? 

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-09 22:41                               ` Song Liu
@ 2025-07-10  0:58                                 ` NeilBrown
  0 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2025-07-10  0:58 UTC (permalink / raw)
  To: Song Liu
  Cc: Christian Brauner, Tingmao Wang, Mickaël Salaün,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack

On Thu, 10 Jul 2025, Song Liu wrote:
> 
> 
> > On Jul 9, 2025, at 3:14 PM, NeilBrown <neil@brown.name> wrote:
> > 
> > On Tue, 08 Jul 2025, Song Liu wrote:
> >> Hi Christian, 
> >> 
> >> Thanks for your comments! 
> >> 
> >>> On Jul 7, 2025, at 4:17 AM, Christian Brauner <brauner@kernel.org> wrote:
> >> 
> >> [...]
> >> 
> >>>>> 3/ Extend vfs_walk_ancestors() to pass a "may sleep" flag to the callback.
> >>>> 
> >>>> I think that's fine.
> >>> 
> >>> Ok, sorry for the delay but there's a lot of different things going on
> >>> right now and this one isn't exactly an easy thing to solve.
> >>> 
> >>> I mentioned this before and so did Neil: the lookup implementation
> >>> supports two modes sleeping and non-sleeping. That api is abstracted
> >>> away as heavily as possible by the VFS so that non-core code will not be
> >>> exposed to it other than in exceptional circumstances and doesn't have
> >>> to care about it.
> >>> 
> >>> It is a conceptual dead-end to expose these two modes via separate APIs
> >>> and leak this implementation detail into non-core code. It will not
> >>> happen as far as I'm concerned.
> >>> 
> >>> I very much understand the urge to get the refcount step-by-step thing
> >>> merged asap. Everyone wants their APIs merged fast. And if it's
> >>> reasonable to move fast we will (see the kernfs xattr thing).
> >>> 
> >>> But here are two use-cases that ask for the same thing with different
> >>> constraints that closely mirror our unified approach. Merging one
> >>> quickly just to have something and then later bolting the other one on
> >>> top, augmenting, or replacing, possible having to deprecate the old API
> >>> is just objectively nuts. That's how we end up with a spaghetthi helper
> >>> collection. We want as little helper fragmentation as possible.
> >>> 
> >>> We need a unified API that serves both use-cases. I dislike
> >>> callback-based APIs generally but we have precedent in the VFS for this
> >>> for cases where the internal state handling is delicate enough that it
> >>> should not be exposed (see __iterate_supers() which does exactly work
> >>> like Neil suggested down to the flag argument itself I added).
> >>> 
> >>> So I'm open to the callback solution.
> >>> 
> >>> (Note for really absurd perf requirements you could even make it work
> >>> with static calls I'm pretty sure.)
> >> 
> >> I guess we will go with Mickaël’s idea:
> >> 
> >>> int vfs_walk_ancestors(struct path *path,
> >>>                      bool (*walk_cb)(const struct path *ancestor, void *data),
> >>>                      void *data, int flags)
> >>> 
> >>> The walk continue while walk_cb() returns true.  walk_cb() can then
> >>> check if @ancestor is equal to a @root, or other properties.  The
> >>> walk_cb() return value (if not bool) should not be returned by
> >>> vfs_walk_ancestors() because a walk stop doesn't mean an error.
> >> 
> >> If necessary, we hide “root" inside @data. This is good. 
> >> 
> >>> @path would be updated with latest ancestor path (e.g. @root).
> >> 
> >> Update @path to the last ancestor and hold proper references. 
> >> I missed this part earlier. With this feature, vfs_walk_ancestors 
> >> should work usable with open-codeed bpf path iterator.
> > 
> > I don't think path should be updated.  That adds complexity which might
> > not be needed.  The original (landlock) requirements were only to look
> > at each ancestor, not to take a reference to any of them.
> 
> I think this is the ideal case that landlock wants in the long term. 
> But we may need to take references when the attempt fails. Also, 
> current landlock code takes reference at each step. 

Why may be need to?
Yes, current landlock code takes references, but I don't think that is
because it needs references, only because the API requires it to take
references. 

> 
> > If the caller needs a reference to any of the ancestors I think that
> > walk_cb() needs to take that reference and store it in data.
> > Note that attempting to take the reference might fail.  See
> > legitimize_path() in fs/namei.c.
> > 
> > It isn't yet clear to me what would be a good API for requesting the
> > reference.
> > One option would be for vfs_walk_ancestors() to pass another void* to
> > walk_cb(), and it passed it on to vfs_legitimize_path() which extracts
> > the seq numbers from there.
> > Another might be that the path passed to walk_cb is always
> > nameidata.path, and so when that is passed to vfs_legitimize_path() path
> > it can use container_of() to find the seq numbers.
> 
> Letting walk_cb() call vfs_legitimize_path() seems suboptimal to me. 
> I think the original goal is to have vfs_walk_ancestors() to:
>   1. Try to walk the ancestors without taking any references;
>   2. Detect when the not-taking-reference walk is not reliable;
>   3. Maybe, retry the walk from beginning, but takes references on 
>      each step. 
> 
> With walk_cb() calling vfs_legitimize_path(), we are moving #2 above 
> to walk_cb(). I think this is not what we want? 

I think you are looking at this the wrong way around.  Focus on the
needs for the caller, not on how you think it might be implemented.

If the caller needs a reference, there should be a way for it to get a
reference.  This is quite separate from the choices vfs_walk_ancestors()
makes about how it is going to walk the list of dentries.

NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-09 22:50                                   ` Song Liu
@ 2025-07-10  0:58                                     ` NeilBrown
  2025-07-10  6:28                                       ` Song Liu
  0 siblings, 1 reply; 50+ messages in thread
From: NeilBrown @ 2025-07-10  0:58 UTC (permalink / raw)
  To: Song Liu
  Cc: Mickaël Salaün, Christian Brauner, Tingmao Wang,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack, Jann Horn

On Thu, 10 Jul 2025, Song Liu wrote:
> 
> 
> > On Jul 9, 2025, at 3:24 PM, NeilBrown <neil@brown.name> wrote:
> [...]
> >> 
> >> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
> >> following code in landlocked:
> >> 
> >> /* Try RCU walk first */
> >> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
> >> 
> >> if (err == -ECHILD) {
> >> struct path walk_path = *path;
> >> 
> >> /* reset any data changed by the walk */
> >> reset_data(data);
> >> 
> >> /* now do ref-walk */
> >> err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
> >> }
> >> 
> >> Or do you mean vfs_walk_ancestors will never return -ECHILD?
> >> Then we need vfs_walk_ancestors to call reset_data logic, right?
> > 
> > It isn't clear to me that vfs_walk_ancestors() needs to return anything.
> > All the communication happens through walk_cb()
> > 
> > walk_cb() is called with a path, the data, and a "may_sleep" flag.
> > If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
> > which causes the walk to restart and use refcounts.
> > If it wants to stop, it returns 0.
> > If it wants to continue, it returns 1.
> > If it wants a reference to the path then it can use (new)
> > vfs_legitimize_path() which might fail.
> > If it wants a reference to the path and may_sleep is true, it can use
> > path_get() which won't fail.
> > 
> > When returning -ECHILD (either because of a need to sleep or because
> > vfs_legitimize_path() fails), walk_cb() would reset_data().
> 
> This might actually work. 
> 
> My only concern is with vfs_legitimize_path. It is probably safer if 
> we only allow taking references with may_sleep==true, so that path_get
> won’t fail. In this case, we will not need walk_cb() to call 
> vfs_legitimize_path. If the user want a reference, the walk_cb will 
> first return -ECHILD, and call path_get when may_sleep is true. 

What is your concern with vfs_legitimize_path() ??

I've since realised that always restarting in response to -ECHILD isn't
necessary and isn't how normal path-walk works.  Restarting might be
needed, but the first response to -ECHILD is to try legitimize_path().
If that succeeds, then it is safe to sleep.
So returning -ECHILD might just result in vfs_walk_ancestors() calling
legitimize_path() and then calling walk_cb() again.  Why not have
walk_cb() do the vfs_legitimize_path() call (which will almost always
succeed in practice).

NeilBrown


> 
> Does this make sense? Did I miss any cases? 
> 
> Thanks,
> Song
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-10  0:58                                     ` NeilBrown
@ 2025-07-10  6:28                                       ` Song Liu
  2025-07-14 21:09                                         ` Song Liu
  0 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-07-10  6:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mickaël Salaün, Christian Brauner, Tingmao Wang,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack, Jann Horn



> On Jul 9, 2025, at 5:58 PM, NeilBrown <neil@brown.name> wrote:
> 
> On Thu, 10 Jul 2025, Song Liu wrote:
>> 
>> 
>>> On Jul 9, 2025, at 3:24 PM, NeilBrown <neil@brown.name> wrote:
>> [...]
>>>> 
>>>> How should the user handle -ECHILD without LOOKUP_RCU flag? Say the
>>>> following code in landlocked:
>>>> 
>>>> /* Try RCU walk first */
>>>> err = vfs_walk_ancestors(path, ll_cb, data, LOOKUP_RCU);
>>>> 
>>>> if (err == -ECHILD) {
>>>> struct path walk_path = *path;
>>>> 
>>>> /* reset any data changed by the walk */
>>>> reset_data(data);
>>>> 
>>>> /* now do ref-walk */
>>>> err = vfs_walk_ancestors(&walk_path, ll_cb, data, 0);
>>>> }
>>>> 
>>>> Or do you mean vfs_walk_ancestors will never return -ECHILD?
>>>> Then we need vfs_walk_ancestors to call reset_data logic, right?
>>> 
>>> It isn't clear to me that vfs_walk_ancestors() needs to return anything.
>>> All the communication happens through walk_cb()
>>> 
>>> walk_cb() is called with a path, the data, and a "may_sleep" flag.
>>> If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
>>> which causes the walk to restart and use refcounts.
>>> If it wants to stop, it returns 0.
>>> If it wants to continue, it returns 1.
>>> If it wants a reference to the path then it can use (new)
>>> vfs_legitimize_path() which might fail.
>>> If it wants a reference to the path and may_sleep is true, it can use
>>> path_get() which won't fail.
>>> 
>>> When returning -ECHILD (either because of a need to sleep or because
>>> vfs_legitimize_path() fails), walk_cb() would reset_data().
>> 
>> This might actually work. 
>> 
>> My only concern is with vfs_legitimize_path. It is probably safer if 
>> we only allow taking references with may_sleep==true, so that path_get
>> won’t fail. In this case, we will not need walk_cb() to call 
>> vfs_legitimize_path. If the user want a reference, the walk_cb will 
>> first return -ECHILD, and call path_get when may_sleep is true.
> 
> What is your concern with vfs_legitimize_path() ??
> 
> I've since realised that always restarting in response to -ECHILD isn't
> necessary and isn't how normal path-walk works.  Restarting might be
> needed, but the first response to -ECHILD is to try legitimize_path().
> If that succeeds, then it is safe to sleep.
> So returning -ECHILD might just result in vfs_walk_ancestors() calling
> legitimize_path() and then calling walk_cb() again.  Why not have
> walk_cb() do the vfs_legitimize_path() call (which will almost always
> succeed in practice).

After reading the emails and the code more, I think I misunderstood 
why we need to call vfs_legitimize_path(). The goal of “legitimize” 
is to get a reference on @path, so a reference-less walk may not
need legitimize_path() at all. Do I get this right this time? 

However, I still have some concern with legitimize_path: it requires
m_seq and r_seq recorded at the beginning of the walk, do we want
to pass those to walk_cb()? IIUC, one of the reason we prefer a 
callback based solution is that it doesn’t expose nameidata (or a
subset of it). Letting walk_cb to call legitimize_path appears to 
defeat this benefit, no? 


A separate question below. 

I still have some question about how vfs_walk_ancestors() and the 
walk_cb() interact. Let’s look at the landlock use case: the user 
(landlock) just want to look at each ancestor, but doesn’t need to 
take any references. walk_cb() will check @path against @root, and 
return 0 when @path is the same as @root. 

IIUC, in this case, we will record m_seq and r_seq at the beginning
of vfs_walk_ancestors(), and check them against mount_lock and 
rename_lock at the end of the walk. (Maybe we also need to check 
them at some points before the end of the walk?) If either seq
changed during the walk, we need to restart the walk, and take
reference on each step. Did I get this right so far? 

If the above is right, here are my questions about the 
reference-less walk above: 

1. Which function (vfs_walk_ancestors or walk_cb) will check m_seq 
   and r_seq? I think vfs_walk_ancestors should check them. 
2. When either seq changes, which function will call reset_data?
   I think there are 3 options here:
  2.a: vfs_walk_ancestors calls reset_data, which will be another
       callback function the caller passes to vfs_walk_ancestors. 
  2.b: walk_cb will call reset_data(), but we need a mechanism to
       tell walk_cb to do it, maybe a “restart” flag?
  2.c: Caller of vfs_walk_ancestors will call reset_data(). In 
       this case, vfs_walk_ancestors will return -ECHILD to its
       caller. But I think this option is NACKed. 

I think the right solution is to have vfs_walk_ancestors check
m_seq and r_seq, and have walk_cb call reset_data. But this is
Different to the proposal above. 

Do my questions above make any sense? Or maybe I totally 
misunderstood something?

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-10  6:28                                       ` Song Liu
@ 2025-07-14 21:09                                         ` Song Liu
  2025-07-24 17:35                                           ` Mickaël Salaün
  0 siblings, 1 reply; 50+ messages in thread
From: Song Liu @ 2025-07-14 21:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mickaël Salaün, Christian Brauner, Tingmao Wang,
	Song Liu, bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack, Jann Horn


> On Jul 9, 2025, at 11:28 PM, Song Liu <songliubraving@meta.com> wrote:

[...]

>>>> It isn't clear to me that vfs_walk_ancestors() needs to return anything.
>>>> All the communication happens through walk_cb()
>>>> 
>>>> walk_cb() is called with a path, the data, and a "may_sleep" flag.
>>>> If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
>>>> which causes the walk to restart and use refcounts.
>>>> If it wants to stop, it returns 0.
>>>> If it wants to continue, it returns 1.
>>>> If it wants a reference to the path then it can use (new)
>>>> vfs_legitimize_path() which might fail.
>>>> If it wants a reference to the path and may_sleep is true, it can use
>>>> path_get() which won't fail.
>>>> 
>>>> When returning -ECHILD (either because of a need to sleep or because
>>>> vfs_legitimize_path() fails), walk_cb() would reset_data().
>>> 
>>> This might actually work. 
>>> 
>>> My only concern is with vfs_legitimize_path. It is probably safer if 
>>> we only allow taking references with may_sleep==true, so that path_get
>>> won’t fail. In this case, we will not need walk_cb() to call 
>>> vfs_legitimize_path. If the user want a reference, the walk_cb will 
>>> first return -ECHILD, and call path_get when may_sleep is true.
>> 
>> What is your concern with vfs_legitimize_path() ??
>> 
>> I've since realised that always restarting in response to -ECHILD isn't
>> necessary and isn't how normal path-walk works.  Restarting might be
>> needed, but the first response to -ECHILD is to try legitimize_path().
>> If that succeeds, then it is safe to sleep.
>> So returning -ECHILD might just result in vfs_walk_ancestors() calling
>> legitimize_path() and then calling walk_cb() again.  Why not have
>> walk_cb() do the vfs_legitimize_path() call (which will almost always
>> succeed in practice).
> 
> After reading the emails and the code more, I think I misunderstood 
> why we need to call vfs_legitimize_path(). The goal of “legitimize” 
> is to get a reference on @path, so a reference-less walk may not
> need legitimize_path() at all. Do I get this right this time? 
> 
> However, I still have some concern with legitimize_path: it requires
> m_seq and r_seq recorded at the beginning of the walk, do we want
> to pass those to walk_cb()? IIUC, one of the reason we prefer a 
> callback based solution is that it doesn’t expose nameidata (or a
> subset of it). Letting walk_cb to call legitimize_path appears to 
> defeat this benefit, no? 
> 
> 
> A separate question below. 
> 
> I still have some question about how vfs_walk_ancestors() and the 
> walk_cb() interact. Let’s look at the landlock use case: the user 
> (landlock) just want to look at each ancestor, but doesn’t need to 
> take any references. walk_cb() will check @path against @root, and 
> return 0 when @path is the same as @root. 
> 
> IIUC, in this case, we will record m_seq and r_seq at the beginning
> of vfs_walk_ancestors(), and check them against mount_lock and 
> rename_lock at the end of the walk. (Maybe we also need to check 
> them at some points before the end of the walk?) If either seq
> changed during the walk, we need to restart the walk, and take
> reference on each step. Did I get this right so far? 
> 
> If the above is right, here are my questions about the 
> reference-less walk above: 
> 
> 1. Which function (vfs_walk_ancestors or walk_cb) will check m_seq 
>   and r_seq? I think vfs_walk_ancestors should check them. 
> 2. When either seq changes, which function will call reset_data?
>   I think there are 3 options here:
>  2.a: vfs_walk_ancestors calls reset_data, which will be another
>       callback function the caller passes to vfs_walk_ancestors. 
>  2.b: walk_cb will call reset_data(), but we need a mechanism to
>       tell walk_cb to do it, maybe a “restart” flag?
>  2.c: Caller of vfs_walk_ancestors will call reset_data(). In 
>       this case, vfs_walk_ancestors will return -ECHILD to its
>       caller. But I think this option is NACKed. 
> 
> I think the right solution is to have vfs_walk_ancestors check
> m_seq and r_seq, and have walk_cb call reset_data. But this is
> Different to the proposal above. 
> 
> Do my questions above make any sense? Or maybe I totally 
> misunderstood something?

Hi Neil, 

Did my questions/comments above make sense? I am hoping we can 
agree on some design soon. 

Christian and Mickaël, 

Could you please also share your thoughts on this?

Current requirements from BPF side is straightforward: we just
need a mechanism to “walk up one level and hold reference”. So
most of the requirement comes from LandLock side. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-14 21:09                                         ` Song Liu
@ 2025-07-24 17:35                                           ` Mickaël Salaün
  2025-07-26  9:52                                             ` Song Liu
  0 siblings, 1 reply; 50+ messages in thread
From: Mickaël Salaün @ 2025-07-24 17:35 UTC (permalink / raw)
  To: Song Liu
  Cc: NeilBrown, Christian Brauner, Tingmao Wang, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack, Jann Horn

On Mon, Jul 14, 2025 at 09:09:42PM +0000, Song Liu wrote:
> 
> > On Jul 9, 2025, at 11:28 PM, Song Liu <songliubraving@meta.com> wrote:
> 
> [...]
> 
> >>>> It isn't clear to me that vfs_walk_ancestors() needs to return anything.
> >>>> All the communication happens through walk_cb()
> >>>> 
> >>>> walk_cb() is called with a path, the data, and a "may_sleep" flag.
> >>>> If it needs to sleep but may_sleep is not set, it returns "-ECHILD"
> >>>> which causes the walk to restart and use refcounts.
> >>>> If it wants to stop, it returns 0.
> >>>> If it wants to continue, it returns 1.
> >>>> If it wants a reference to the path then it can use (new)
> >>>> vfs_legitimize_path() which might fail.
> >>>> If it wants a reference to the path and may_sleep is true, it can use
> >>>> path_get() which won't fail.
> >>>> 
> >>>> When returning -ECHILD (either because of a need to sleep or because
> >>>> vfs_legitimize_path() fails), walk_cb() would reset_data().
> >>> 
> >>> This might actually work. 
> >>> 
> >>> My only concern is with vfs_legitimize_path. It is probably safer if 
> >>> we only allow taking references with may_sleep==true, so that path_get
> >>> won’t fail. In this case, we will not need walk_cb() to call 
> >>> vfs_legitimize_path. If the user want a reference, the walk_cb will 
> >>> first return -ECHILD, and call path_get when may_sleep is true.
> >> 
> >> What is your concern with vfs_legitimize_path() ??
> >> 
> >> I've since realised that always restarting in response to -ECHILD isn't
> >> necessary and isn't how normal path-walk works.  Restarting might be
> >> needed, but the first response to -ECHILD is to try legitimize_path().
> >> If that succeeds, then it is safe to sleep.
> >> So returning -ECHILD might just result in vfs_walk_ancestors() calling
> >> legitimize_path() and then calling walk_cb() again.  Why not have
> >> walk_cb() do the vfs_legitimize_path() call (which will almost always
> >> succeed in practice).
> > 
> > After reading the emails and the code more, I think I misunderstood 
> > why we need to call vfs_legitimize_path(). The goal of “legitimize” 
> > is to get a reference on @path, so a reference-less walk may not
> > need legitimize_path() at all. Do I get this right this time? 
> > 
> > However, I still have some concern with legitimize_path: it requires
> > m_seq and r_seq recorded at the beginning of the walk, do we want
> > to pass those to walk_cb()? IIUC, one of the reason we prefer a 
> > callback based solution is that it doesn’t expose nameidata (or a
> > subset of it). Letting walk_cb to call legitimize_path appears to 
> > defeat this benefit, no? 

Yes, walk_cb() should be very light and non-blocking/non-sleepable.  If
the caller cannot give these guarantees, then it can just pass NULL
instead of a valid walk_cb(), and continue the walk (if needed) by
calling the vfs_walk_ancentors() helper again, which would not benefit
from the RCU optimization in this case.

Before this patch series land, handling of disconnected directories
should be well defined, or at least let the caller deal with it.  How do
you plan to handle disconnected directories for the eBPF use case?  See
https://lore.kernel.org/all/20250719104204.545188-1-mic@digikod.net/
Unfortunately, this issue is not solved for Landlock yet.

> > 
> > 
> > A separate question below. 
> > 
> > I still have some question about how vfs_walk_ancestors() and the 
> > walk_cb() interact. Let’s look at the landlock use case: the user 
> > (landlock) just want to look at each ancestor, but doesn’t need to 
> > take any references. walk_cb() will check @path against @root, and 
> > return 0 when @path is the same as @root. 
> > 
> > IIUC, in this case, we will record m_seq and r_seq at the beginning
> > of vfs_walk_ancestors(), and check them against mount_lock and 
> > rename_lock at the end of the walk. (Maybe we also need to check 
> > them at some points before the end of the walk?) If either seq
> > changed during the walk, we need to restart the walk, and take
> > reference on each step. Did I get this right so far? 

I think so.  You should get some inspiration from prepend_path().

> > 
> > If the above is right, here are my questions about the 
> > reference-less walk above: 
> > 
> > 1. Which function (vfs_walk_ancestors or walk_cb) will check m_seq 
> >   and r_seq? I think vfs_walk_ancestors should check them. 

Yes, walk_cb() should be as simple as possible: the simpler version
should just return a constant.

> > 2. When either seq changes, which function will call reset_data?
> >   I think there are 3 options here:
> >  2.a: vfs_walk_ancestors calls reset_data, which will be another
> >       callback function the caller passes to vfs_walk_ancestors. 
> >  2.b: walk_cb will call reset_data(), but we need a mechanism to
> >       tell walk_cb to do it, maybe a “restart” flag?
> >  2.c: Caller of vfs_walk_ancestors will call reset_data(). In 
> >       this case, vfs_walk_ancestors will return -ECHILD to its
> >       caller. But I think this option is NACKed. 
> > 
> > I think the right solution is to have vfs_walk_ancestors check
> > m_seq and r_seq, and have walk_cb call reset_data. But this is
> > Different to the proposal above. 

I'm not sure a reset_data() would be useful if walk_cb() never sleep.

If we really need such reset_data(), a fourth option would be for
walk_cb() to return a specific value (an enum instead of a bool) to
trigger the reset.

> > 
> > Do my questions above make any sense? Or maybe I totally 
> > misunderstood something?
> 
> Hi Neil, 
> 
> Did my questions/comments above make sense? I am hoping we can 
> agree on some design soon. 
> 
> Christian and Mickaël, 
> 
> Could you please also share your thoughts on this?
> 
> Current requirements from BPF side is straightforward: we just
> need a mechanism to “walk up one level and hold reference”. So
> most of the requirement comes from LandLock side. 

Have you thought about how to handle disconnected directories?

> 
> Thanks,
> Song
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v5 bpf-next 0/5] bpf path iterator
  2025-07-24 17:35                                           ` Mickaël Salaün
@ 2025-07-26  9:52                                             ` Song Liu
  0 siblings, 0 replies; 50+ messages in thread
From: Song Liu @ 2025-07-26  9:52 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: NeilBrown, Christian Brauner, Tingmao Wang, Song Liu,
	bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, Kernel Team,
	andrii@kernel.org, eddyz87@gmail.com, ast@kernel.org,
	daniel@iogearbox.net, martin.lau@linux.dev,
	viro@zeniv.linux.org.uk, jack@suse.cz, kpsingh@kernel.org,
	mattbobrowski@google.com, Günther Noack, Jann Horn



> On Jul 25, 2025, at 1:35 AM, Mickaël Salaün <mic@digikod.net> wrote:
[...]
>>> 
>>> Do my questions above make any sense? Or maybe I totally 
>>> misunderstood something?
>> 
>> Hi Neil, 
>> 
>> Did my questions/comments above make sense? I am hoping we can 
>> agree on some design soon. 
>> 
>> Christian and Mickaël, 
>> 
>> Could you please also share your thoughts on this?
>> 
>> Current requirements from BPF side is straightforward: we just
>> need a mechanism to “walk up one level and hold reference”. So
>> most of the requirement comes from LandLock side.
> 
> Have you thought about how to handle disconnected directories?

In the case of open-coded path iterator, the iterator will 
return a special value for disconnected roots and disconnected 
directories. Then the BPF program need to handle them based on 
the policy. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2025-07-26  9:52 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-17  6:11 [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
2025-06-17  6:11 ` [PATCH v5 bpf-next 1/5] namei: Introduce new helper function path_walk_parent() Song Liu
2025-06-18  1:02   ` kernel test robot
2025-06-24 12:18   ` Jan Kara
2025-06-24 17:37     ` Song Liu
2025-06-25 10:30       ` Jan Kara
2025-07-04 17:40   ` Yonghong Song
2025-07-06 23:54     ` Song Liu
2025-07-07 17:53       ` Yonghong Song
2025-06-17  6:11 ` [PATCH v5 bpf-next 2/5] landlock: Use path_walk_parent() Song Liu
2025-07-03 18:29   ` Mickaël Salaün
2025-07-03 22:27     ` Song Liu
2025-07-04  9:00       ` Mickaël Salaün
2025-07-06 22:29         ` Song Liu
2025-07-07 10:28         ` Christian Brauner
2025-06-17  6:11 ` [PATCH v5 bpf-next 3/5] bpf: Introduce path iterator Song Liu
2025-06-17  6:11 ` [PATCH v5 bpf-next 4/5] selftests/bpf: Add tests for bpf " Song Liu
2025-06-17  6:11 ` [PATCH v5 bpf-next 5/5] selftests/bpf: Path walk test Song Liu
2025-06-20 21:59 ` [PATCH v5 bpf-next 0/5] bpf path iterator Song Liu
2025-06-24 18:45   ` Mickaël Salaün
2025-06-24 21:38     ` NeilBrown
2025-06-25 13:14       ` Mickaël Salaün
2025-06-25 23:04         ` NeilBrown
2025-06-25 23:17           ` Song Liu
2025-06-26  0:07           ` Tingmao Wang
2025-06-26  1:05             ` NeilBrown
2025-06-26  5:52               ` Song Liu
2025-06-26  9:43                 ` Mickaël Salaün
2025-06-26 14:49                   ` Song Liu
2025-06-26 10:22                 ` NeilBrown
2025-06-26 14:28                   ` Song Liu
2025-06-26 22:51                     ` NeilBrown
2025-06-27  0:21                       ` Song Liu
2025-07-07 10:46                       ` Christian Brauner
2025-07-07 11:17                         ` Christian Brauner
2025-07-07 18:50                           ` Song Liu
2025-07-09 16:06                             ` Mickaël Salaün
2025-07-09 17:31                               ` Song Liu
2025-07-09 22:24                                 ` NeilBrown
2025-07-09 22:50                                   ` Song Liu
2025-07-10  0:58                                     ` NeilBrown
2025-07-10  6:28                                       ` Song Liu
2025-07-14 21:09                                         ` Song Liu
2025-07-24 17:35                                           ` Mickaël Salaün
2025-07-26  9:52                                             ` Song Liu
2025-07-09 22:14                             ` NeilBrown
2025-07-09 22:41                               ` Song Liu
2025-07-10  0:58                                 ` NeilBrown
2025-07-07 10:43               ` Christian Brauner
2025-07-03  5:04     ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).