linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] landlock: walk parent dir with RCU, without taking references
@ 2025-06-04  0:45 Tingmao Wang
  2025-06-04  0:45 ` [RFC PATCH 1/3] landlock: walk parent dir " Tingmao Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Tingmao Wang @ 2025-06-04  0:45 UTC (permalink / raw)
  To: Mickaël Salaün, Song Liu, Al Viro
  Cc: Tingmao Wang, Günther Noack, Jan Kara, Alexei Starovoitov,
	Christian Brauner, linux-security-module, linux-fsdevel

Hi Mickaël, Song, Al and other reviewers,

In this series I present some experiments and investigations I did on a
potential optimization for Landlock, which is to avoid taking references
during the walk from the target being accessed toward the global root.  I
initially made this change and tested it, but did not send for review as I
was slightly uncertain about the correctness here, but I did saw
significant performance improvements.  However, earlier I saw Al Viro mention
in another discussion [1]:

> Note, BTW, that it might be better off by doing that similar to
> d_path.c - without arseloads of dget_parent/dput et.al.; not sure
> how feasible it is, but if everything in it can be done under
> rcu_read_lock(), that's something to look into.

which prompted me to pick this up again (even though I realize in the
reply to [1] Song wasn't sure this would work).  After some thinking, I
can think of one particular case where a simple chase of `d_parent`s would
be incorrect - that is, when unlinks are involved and dentry can turn
negative.  I _think_ that if we can check the global rename seqcount and
restart the pathwalk with the old approach if it changes, we might avoid
these sort of issues.  However I'm new to this and very welcome to be
proven wrong here :)

I realize that Song has already submitted a patch [2] to extract out the
pathwalk logic here and improve the mountpoint handling, and I don't want
this to block that in any way, but I think this is worth looking into
separate from that, based on the benchmarking results below:

Earlier when testing other Landlock performance improvement ideas I came
up with what I think is a "typical" workload [3] involving two layers and
a few rules to places like $HOME, /tmp, /proc etc. (Note that this is just
something I came up with, and Mickaël might not agree on the "typicalness"
here.)  With that we have:

Comparing:                    original pathwalk-noref
  // this is the % of time spent in Landlock within an openat syscall
  landlock_overhead:    avg = 33       27      (-6)      (%)
                     median = 34       28      (-6)      (%)
  // absolute time spent in Landlock
  landlock_hook:        avg = 852      627     (-26.4%)  (ns)
                     median = 827      611     (-26.1%)  (ns)
  // duration of the entire openat syscall
  open_syscall:         avg = 2507     2260    (-9.9%)   (ns)
                     median = 2451     2222    (-9.3%)   (ns)

I also ran the benchmarks Mickaël created [4], which tests openat
performance at various depths in the file hierarchy, all with 2 rules, and
the results are basically:

depth inc. /    % change in median time spent in Landlock
                          % change in median openat duration
                                   % in Landlock, old -> new
           1    -21.9%    +3.4%    9 -> 7
          10    -45.3%    -2.4%    16 -> 9
          20    -57.2%    -7.7%    21 -> 10
          30    -61.1%    -75.9%   22 -> 11
                        // ^^ On the unchanged kernel, this seems to be a bimodal
                        // distribution when landlock is involved.  Outliers happens
                        // much less on the kernel with this patch applied.

Raw results including histograms at https://fileshare.maowtm.org/landlock/20250603/index.html

[1]: https://lore.kernel.org/all/20250529231018.GP2023217@ZenIV/
[2]: https://lore.kernel.org/all/20250603065920.3404510-1-song@kernel.org/
[3]: https://github.com/torvalds/linux/commit/f1865ce970af97ac3b6f4edf580529b8cdc66371
[4]: https://github.com/landlock-lsm/landlock-test-tools/pull/16

Tingmao Wang (3):
  landlock: walk parent dir without taking references
  selftests/landlock: Add fs_race_test
  Restart pathwalk on rename seqcount change

 security/landlock/fs.c                        | 133 ++++-
 .../testing/selftests/landlock/fs_race_test.c | 505 ++++++++++++++++++
 2 files changed, 617 insertions(+), 21 deletions(-)
 create mode 100644 tools/testing/selftests/landlock/fs_race_test.c


base-commit: a5806cd506af5a7c19bcd596e4708b5c464bfd21
--
2.49.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/3] landlock: walk parent dir without taking references
  2025-06-04  0:45 [RFC PATCH 0/3] landlock: walk parent dir with RCU, without taking references Tingmao Wang
@ 2025-06-04  0:45 ` Tingmao Wang
  2025-06-04 17:15   ` Mickaël Salaün
  2025-06-04  0:45 ` [RFC PATCH 2/3] selftests/landlock: Add fs_race_test Tingmao Wang
  2025-06-04  0:45 ` [RFC PATCH 3/3] Restart pathwalk on rename seqcount change Tingmao Wang
  2 siblings, 1 reply; 13+ messages in thread
From: Tingmao Wang @ 2025-06-04  0:45 UTC (permalink / raw)
  To: Mickaël Salaün, Song Liu, Al Viro
  Cc: Tingmao Wang, Günther Noack, Jan Kara, Alexei Starovoitov,
	Christian Brauner, linux-security-module, linux-fsdevel

This commit replaces dget_parent with a direct read of d_parent. By
holding rcu read lock we should be safe in terms of not reading freed
memory, but this is still problematic due to move+unlink, as will be shown
with the test in the next commit.

Note that follow_up is still used when walking up a mountpoint.

Signed-off-by: Tingmao Wang <m@maowtm.org>
---
 security/landlock/fs.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 6fee7c20f64d..923737412cfa 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -361,7 +361,7 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
  * Returns NULL if no rule is found or if @dentry is negative.
  */
 static const struct landlock_rule *
-find_rule(const struct landlock_ruleset *const domain,
+find_rule_rcu(const struct landlock_ruleset *const domain,
 	  const struct dentry *const dentry)
 {
 	const struct landlock_rule *rule;
@@ -375,10 +375,10 @@ find_rule(const struct landlock_ruleset *const domain,
 		return NULL;
 
 	inode = d_backing_inode(dentry);
-	rcu_read_lock();
+	if (unlikely(!inode))
+		return NULL;
 	id.key.object = rcu_dereference(landlock_inode(inode)->object);
 	rule = landlock_find_rule(domain, id);
-	rcu_read_unlock();
 	return rule;
 }
 
@@ -809,9 +809,11 @@ static bool is_access_to_paths_allowed(
 		is_dom_check = false;
 	}
 
+	rcu_read_lock();
+
 	if (unlikely(dentry_child1)) {
 		landlock_unmask_layers(
-			find_rule(domain, dentry_child1),
+			find_rule_rcu(domain, dentry_child1),
 			landlock_init_layer_masks(
 				domain, LANDLOCK_MASK_ACCESS_FS,
 				&_layer_masks_child1, LANDLOCK_KEY_INODE),
@@ -821,7 +823,7 @@ static bool is_access_to_paths_allowed(
 	}
 	if (unlikely(dentry_child2)) {
 		landlock_unmask_layers(
-			find_rule(domain, dentry_child2),
+			find_rule_rcu(domain, dentry_child2),
 			landlock_init_layer_masks(
 				domain, LANDLOCK_MASK_ACCESS_FS,
 				&_layer_masks_child2, LANDLOCK_KEY_INODE),
@@ -831,7 +833,6 @@ static bool is_access_to_paths_allowed(
 	}
 
 	walker_path = *path;
-	path_get(&walker_path);
 	/*
 	 * We need to walk through all the hierarchy to not miss any relevant
 	 * restriction.
@@ -880,7 +881,7 @@ static bool is_access_to_paths_allowed(
 				break;
 		}
 
-		rule = find_rule(domain, walker_path.dentry);
+		rule = find_rule_rcu(domain, walker_path.dentry);
 		allowed_parent1 = allowed_parent1 ||
 				  landlock_unmask_layers(
 					  rule, access_masked_parent1,
@@ -897,10 +898,14 @@ static bool is_access_to_paths_allowed(
 			break;
 jump_up:
 		if (walker_path.dentry == walker_path.mnt->mnt_root) {
+			/* follow_up gets the parent and puts the passed in path */
+			path_get(&walker_path);
 			if (follow_up(&walker_path)) {
+				path_put(&walker_path);
 				/* Ignores hidden mount points. */
 				goto jump_up;
 			} else {
+				path_put(&walker_path);
 				/*
 				 * Stops at the real root.  Denies access
 				 * because not all layers have granted access.
@@ -920,11 +925,11 @@ static bool is_access_to_paths_allowed(
 			}
 			break;
 		}
-		parent_dentry = dget_parent(walker_path.dentry);
-		dput(walker_path.dentry);
+		parent_dentry = walker_path.dentry->d_parent;
 		walker_path.dentry = parent_dentry;
 	}
-	path_put(&walker_path);
+
+	rcu_read_unlock();
 
 	if (!allowed_parent1) {
 		log_request_parent1->type = LANDLOCK_REQUEST_FS_ACCESS;
@@ -1045,12 +1050,11 @@ static bool collect_domain_accesses(
 					       layer_masks_dom,
 					       LANDLOCK_KEY_INODE);
 
-	dget(dir);
-	while (true) {
-		struct dentry *parent_dentry;
+	rcu_read_lock();
 
+	while (true) {
 		/* Gets all layers allowing all domain accesses. */
-		if (landlock_unmask_layers(find_rule(domain, dir), access_dom,
+		if (landlock_unmask_layers(find_rule_rcu(domain, dir), access_dom,
 					   layer_masks_dom,
 					   ARRAY_SIZE(*layer_masks_dom))) {
 			/*
@@ -1065,11 +1069,11 @@ static bool collect_domain_accesses(
 		if (dir == mnt_root || WARN_ON_ONCE(IS_ROOT(dir)))
 			break;
 
-		parent_dentry = dget_parent(dir);
-		dput(dir);
-		dir = parent_dentry;
+		dir = dir->d_parent;
 	}
-	dput(dir);
+
+	rcu_read_unlock();
+
 	return ret;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/3] selftests/landlock: Add fs_race_test
  2025-06-04  0:45 [RFC PATCH 0/3] landlock: walk parent dir with RCU, without taking references Tingmao Wang
  2025-06-04  0:45 ` [RFC PATCH 1/3] landlock: walk parent dir " Tingmao Wang
@ 2025-06-04  0:45 ` Tingmao Wang
  2025-06-04  0:45 ` [RFC PATCH 3/3] Restart pathwalk on rename seqcount change Tingmao Wang
  2 siblings, 0 replies; 13+ messages in thread
From: Tingmao Wang @ 2025-06-04  0:45 UTC (permalink / raw)
  To: Mickaël Salaün, Song Liu, Al Viro
  Cc: Tingmao Wang, Günther Noack, Jan Kara, Alexei Starovoitov,
	Christian Brauner, linux-security-module, linux-fsdevel

By not taking references to parent directories when walking, the dentry
can become negative under us, if the target file is moved then the parent
quickly deleted.  This is problematic because it means that we lose access
to any landlock rules attached to that parent, and thus we won't be able
to grant the correct access even when we're allowed to pretend the move
hasn't happened yet.

This commit tests a slightly more complicated scenario, where after moving
the file's parent directory away, the next two intermediate directories
are quickly removed.  This demonstrates that in this situation the only
choice for Landlock is to restart the walk.  Without doing that, even if
we were to re-check d_parent, if we were "cut off" when we've already
walked away from the original leaf, we would still not be able to recover.

As an illustration:

mkdir /d1 /d1/d2 /b
create landlock rule on /d1
create landlock rule on /b
touch /d1/d2/file

thread 1                              thread 2
                                      cd /d1/d2
                                      cat file
                                        landlock walks to /d1/d2 without ref,
                                        checked rule on /d1/d2 (nothing),
                                        about to walk up to /d1
mv /d1/d2/file /b
rmdir /d1/d2 /d1
(/d1/d2 and /d1 both becomes negative)
                                        notices stuff changed
                                        at this point, we're looking at /d1/d2
                                        and trying to walk to its parent.
                                        however, both /d1/d2 and /d1 are negative
                                        now.
                                        Our only choice is to restart the walk
                                        altogether from the original file's dentry,
                                        which will now have d_parent /b.

The test is probablistic as it tests for a race condition, but I found
that in my environment, with the previous patch, it pretty much always
reliably fails within 10 seconds.  I've set the timeout to 30 seconds, and
the test will pass if no permission errors (or other errors) detected.  In
those 30 seconds it will keep recreating the above directory structure
(except with a lot more sibling directories so it can run for some time
before it "exhausts" all the directories and has to recreate the whole
thing).

Signed-off-by: Tingmao Wang <m@maowtm.org>
---
 .../testing/selftests/landlock/fs_race_test.c | 505 ++++++++++++++++++
 1 file changed, 505 insertions(+)
 create mode 100644 tools/testing/selftests/landlock/fs_race_test.c

diff --git a/tools/testing/selftests/landlock/fs_race_test.c b/tools/testing/selftests/landlock/fs_race_test.c
new file mode 100644
index 000000000000..16a70ff90532
--- /dev/null
+++ b/tools/testing/selftests/landlock/fs_race_test.c
@@ -0,0 +1,505 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Landlock tests - Pathwalk race conditions
+ *
+ * Copyright © 2025 Tingmao Wang <m@maowtm.org>
+ */
+
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <time.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+
+#include "common.h"
+
+#define NUM_SUBDIRS 1000
+#define TEST_DIR TMP_DIR "/fs_race_test"
+#define SUBDIR_NAME_FORMAT "s%dd1"
+#define SUBDIR2_NAME_FORMAT "s%dd2"
+#define SUBDIR3_NAME "d3"
+#define TEST_FILE_NAME "file"
+#define TEST_TIME 30
+#define RANDOM_DELAY_AFTER_MOVE false
+
+/* layout hierarchy:
+ * tmp
+ * └── fs_race_test
+ *     ├── s0d1
+ *     │   └── s0d2
+ *     │       └── d3
+ *     │           └── file
+ *     |── s1d1
+ *     │   └── s1d2
+ *     └── ...
+ */
+
+FIXTURE(layout)
+{
+	int base_dir_fd;
+	bool need_subdir_cleanup;
+	int subdir_fds[NUM_SUBDIRS];
+	int subdir2_fds[NUM_SUBDIRS];
+	int subdir3_fd;
+	int subdir3_at;
+	int ruleset_fd;
+};
+
+static void create_subdirs(struct __test_metadata *const _metadata,
+			   struct _test_data_layout *const self)
+{
+	int i, err;
+	char subdir[20], subdir2[20];
+
+	for (i = 0; i < NUM_SUBDIRS; i++) {
+		snprintf(subdir, sizeof(subdir), SUBDIR_NAME_FORMAT, i);
+		err = mkdirat(self->base_dir_fd, subdir, 0755);
+
+		ASSERT_TRUE(err == 0 || errno == EEXIST)
+		{
+			TH_LOG("Failed to create " TEST_DIR "/%s: %s", subdir,
+			       strerror(errno));
+		}
+		self->subdir_fds[i] = openat(self->base_dir_fd, subdir, O_PATH);
+		ASSERT_NE(self->subdir_fds[i], -1)
+		{
+			TH_LOG("Failed to open " TEST_DIR "/%s: %s", subdir,
+			       strerror(errno));
+		}
+
+		snprintf(subdir2, sizeof(subdir2), SUBDIR2_NAME_FORMAT, i);
+		err = mkdirat(self->subdir_fds[i], subdir2, 0755);
+		ASSERT_TRUE(err == 0 || errno == EEXIST)
+		{
+			TH_LOG("Failed to create " TEST_DIR "/%s/%s: %s",
+			       subdir, subdir2, strerror(errno));
+		}
+		self->subdir2_fds[i] =
+			openat(self->subdir_fds[i], subdir2, O_PATH);
+		ASSERT_NE(self->subdir2_fds[i], -1)
+		{
+			TH_LOG("Failed to open " TEST_DIR "/%s/%s: %s", subdir,
+			       subdir2, strerror(errno));
+		}
+	}
+
+	self->subdir3_at = 0;
+	err = mkdirat(self->subdir2_fds[self->subdir3_at], SUBDIR3_NAME, 0755);
+	ASSERT_TRUE(err == 0)
+	{
+		TH_LOG("Failed to create " TEST_DIR "/" SUBDIR_NAME_FORMAT
+		       "/" SUBDIR2_NAME_FORMAT "/" SUBDIR3_NAME ": %s",
+		       self->subdir3_at, self->subdir3_at, strerror(errno));
+	}
+	self->subdir3_fd = openat(self->subdir2_fds[self->subdir3_at],
+				  SUBDIR3_NAME, O_PATH);
+	ASSERT_NE(self->subdir3_fd, -1)
+	{
+		TH_LOG("Failed to open " TEST_DIR "/" SUBDIR_NAME_FORMAT
+		       "/" SUBDIR2_NAME_FORMAT "/" SUBDIR3_NAME ": %s",
+		       self->subdir3_at, self->subdir3_at, strerror(errno));
+	}
+
+	self->need_subdir_cleanup = true;
+}
+
+static void cleanup_subdirs(struct __test_metadata *const _metadata,
+			    struct _test_data_layout *const self)
+{
+	int i, err;
+	char subdir[20], subdir2[20];
+
+	if (!self->need_subdir_cleanup)
+		return;
+
+	self->need_subdir_cleanup = false;
+
+	if (self->subdir3_fd != -1) {
+		err = unlinkat(self->subdir3_fd, TEST_FILE_NAME, 0);
+		ASSERT_TRUE(err == 0 || errno == ENOENT)
+		{
+			TH_LOG("Failed to remove " TEST_DIR
+			       "/" SUBDIR_NAME_FORMAT "/" SUBDIR2_NAME_FORMAT
+			       "/" SUBDIR3_NAME "/" TEST_FILE_NAME ": %s",
+			       self->subdir3_at, self->subdir3_at,
+			       strerror(errno));
+		}
+		close(self->subdir3_fd);
+		self->subdir3_fd = -1;
+
+		err = unlinkat(self->subdir2_fds[self->subdir3_at],
+			       SUBDIR3_NAME, AT_REMOVEDIR);
+		ASSERT_TRUE(err == 0 || errno == ENOENT)
+		{
+			TH_LOG("Failed to remove " TEST_DIR
+			       "/" SUBDIR_NAME_FORMAT "/" SUBDIR2_NAME_FORMAT
+			       "/" SUBDIR3_NAME ": %s",
+			       self->subdir3_at, self->subdir3_at,
+			       strerror(errno));
+		}
+		self->subdir3_at = -1;
+	}
+
+	for (i = 0; i < NUM_SUBDIRS; i++) {
+		if (self->subdir2_fds[i] != -1) {
+			close(self->subdir2_fds[i]);
+			self->subdir2_fds[i] = -1;
+
+			snprintf(subdir2, sizeof(subdir2), SUBDIR2_NAME_FORMAT,
+				 i);
+			err = unlinkat(self->subdir_fds[i], subdir2,
+				       AT_REMOVEDIR);
+			ASSERT_TRUE(err == 0 || errno == ENOENT)
+			{
+				TH_LOG("Failed to remove " TEST_DIR
+				       "/" SUBDIR_NAME_FORMAT "/%s: %s",
+				       i, subdir2, strerror(errno));
+			}
+		}
+
+		if (self->subdir_fds[i] == -1)
+			continue;
+
+		close(self->subdir_fds[i]);
+		self->subdir_fds[i] = -1;
+
+		snprintf(subdir, sizeof(subdir), SUBDIR_NAME_FORMAT, i);
+		err = unlinkat(self->base_dir_fd, subdir, AT_REMOVEDIR);
+		ASSERT_TRUE(err == 0 || errno == ENOENT)
+		{
+			TH_LOG("Failed to remove " TEST_DIR "/%s: %s", subdir,
+			       strerror(errno));
+		}
+	}
+}
+
+static void create_test_dir(struct __test_metadata *const _metadata,
+			    struct _test_data_layout *const self)
+{
+	int err;
+
+	err = mkdir(TMP_DIR, 0755);
+	ASSERT_TRUE(err == 0 || errno == EEXIST)
+	{
+		TH_LOG("Failed to create ./" TMP_DIR ": %s", strerror(errno));
+		return;
+	}
+
+	err = mkdir(TEST_DIR, 0755);
+	ASSERT_TRUE(err == 0 || errno == EEXIST)
+	{
+		TH_LOG("Failed to create " TEST_DIR ": %s", strerror(errno));
+		return;
+	}
+
+	self->base_dir_fd = open(TEST_DIR, O_PATH);
+	ASSERT_NE(self->base_dir_fd, -1)
+	{
+		TH_LOG("Failed to open " TEST_DIR ": %s", strerror(errno));
+		return;
+	}
+}
+
+static void cleanup_test_dir(struct __test_metadata *const _metadata,
+			     struct _test_data_layout *const self)
+{
+	int err;
+
+	close(self->base_dir_fd);
+	err = rmdir(TEST_DIR);
+	ASSERT_EQ(0, err)
+	{
+		TH_LOG("Failed to remove " TEST_DIR ": %s", strerror(errno));
+	}
+	err = rmdir(TMP_DIR);
+	ASSERT_EQ(0, err)
+	{
+		TH_LOG("Failed to remove ./" TMP_DIR ": %s", strerror(errno));
+	}
+}
+
+static void create_test_file(struct __test_metadata *const _metadata,
+			     struct _test_data_layout *const self)
+{
+	int dfd;
+	int fd;
+
+	ASSERT_NE(-1, self->subdir3_at);
+	dfd = self->subdir3_fd;
+	ASSERT_NE(-1, dfd);
+
+	fd = openat(dfd, TEST_FILE_NAME, O_CREAT | O_RDWR, 0644);
+	ASSERT_NE(-1, fd)
+	{
+		TH_LOG("Failed to create " TEST_DIR "/" SUBDIR_NAME_FORMAT
+		       "/" SUBDIR2_NAME_FORMAT "/" SUBDIR3_NAME
+		       "/" TEST_FILE_NAME ": %s",
+		       self->subdir3_at, self->subdir3_at, strerror(errno));
+		return;
+	}
+	close(fd);
+}
+
+struct shared_region {
+	bool stop;
+};
+
+static void move_subdir3_and_rmdir(struct __test_metadata *const _metadata,
+				   struct _test_data_layout *const self, int to)
+{
+	int from, to_fd, err;
+	char pathbuf1[255], pathbuf2[255], pathbuf3[255], pathbuf4[255];
+
+	ASSERT_NE(to, self->subdir3_at);
+
+	from = self->subdir3_at;
+	ASSERT_NE(-1, from);
+	to_fd = self->subdir2_fds[to];
+	ASSERT_NE(-1, to_fd);
+
+	snprintf(pathbuf1, sizeof(pathbuf1), SUBDIR_NAME_FORMAT, from);
+	snprintf(pathbuf2, sizeof(pathbuf2),
+		 SUBDIR_NAME_FORMAT "/" SUBDIR2_NAME_FORMAT, from, from);
+	snprintf(pathbuf3, sizeof(pathbuf3),
+		 SUBDIR_NAME_FORMAT "/" SUBDIR2_NAME_FORMAT "/" SUBDIR3_NAME,
+		 from, from);
+	snprintf(pathbuf4, sizeof(pathbuf4),
+		 SUBDIR_NAME_FORMAT "/" SUBDIR2_NAME_FORMAT "/" SUBDIR3_NAME,
+		 to, to);
+
+	close(self->subdir2_fds[from]);
+	close(self->subdir_fds[from]);
+
+	/*
+	 * rename and the 2 following unlinkat must be executed as close as
+	 * possible
+	 */
+
+	err = renameat(self->base_dir_fd, pathbuf3, self->base_dir_fd,
+		       pathbuf4);
+	ASSERT_EQ(0, err)
+	{
+		TH_LOG("Failed to move " SUBDIR3_NAME
+		       " from " SUBDIR_NAME_FORMAT "/" SUBDIR2_NAME_FORMAT
+		       " to " SUBDIR_NAME_FORMAT "/" SUBDIR2_NAME_FORMAT ": %s",
+		       from, from, to, to, strerror(errno));
+	}
+
+	err = unlinkat(self->base_dir_fd, pathbuf2, AT_REMOVEDIR);
+	ASSERT_NE(-1, err)
+	{
+		TH_LOG("Failed to remove %s: %s", pathbuf2, strerror(errno));
+	}
+
+	err = unlinkat(self->base_dir_fd, pathbuf1, AT_REMOVEDIR);
+	ASSERT_NE(-1, err)
+	{
+		TH_LOG("Failed to remove " TEST_DIR "/%s: %s", pathbuf1,
+		       strerror(errno));
+	}
+
+	self->subdir_fds[from] = -1;
+	self->subdir2_fds[from] = -1;
+	self->subdir3_at = to;
+}
+
+static void create_ruleset(struct __test_metadata *const _metadata,
+			   struct _test_data_layout *const self)
+{
+	struct landlock_ruleset_attr ruleset_attr = {
+		.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE |
+				     LANDLOCK_ACCESS_FS_READ_DIR |
+				     LANDLOCK_ACCESS_FS_WRITE_FILE |
+				     LANDLOCK_ACCESS_FS_REMOVE_FILE |
+				     LANDLOCK_ACCESS_FS_MAKE_REG |
+				     LANDLOCK_ACCESS_FS_MAKE_DIR |
+				     LANDLOCK_ACCESS_FS_REMOVE_DIR |
+				     LANDLOCK_ACCESS_FS_REFER,
+		.handled_access_net = 0,
+		.scoped = 0,
+	};
+	struct landlock_path_beneath_attr rule_attr = {
+		.parent_fd = -1,
+		.allowed_access = LANDLOCK_ACCESS_FS_READ_FILE |
+				  LANDLOCK_ACCESS_FS_READ_DIR,
+	};
+	int ruleset_fd, err, dfd;
+
+	ruleset_fd =
+		landlock_create_ruleset(&ruleset_attr, sizeof(ruleset_attr), 0);
+	ASSERT_GE(ruleset_fd, 0)
+	{
+		TH_LOG("Failed to create ruleset: %s", strerror(errno));
+	}
+
+	for (int i = 0; i < NUM_SUBDIRS; i++) {
+		/* We want the rule to be on s*d1 */
+		dfd = self->subdir_fds[i];
+		ASSERT_NE(-1, dfd);
+		rule_attr.parent_fd = dfd;
+		err = landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
+					&rule_attr, 0);
+		ASSERT_EQ(0, err)
+		{
+			TH_LOG("Failed to add rule for " TEST_DIR
+			       "/" SUBDIR_NAME_FORMAT ": %s",
+			       i, strerror(errno));
+		}
+	}
+
+	self->ruleset_fd = ruleset_fd;
+}
+
+static int child_restrict_self(int ruleset_fd)
+{
+	int err, n;
+	char errstr[512];
+
+	err = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	if (err != 0) {
+		err = errno;
+		n = snprintf(errstr, sizeof(errstr),
+			     "child process prctl(PR_SET_NO_NEW_PRIVS): %s\n",
+			     strerror(err));
+		write(STDERR_FILENO, errstr, n + 1);
+		return err;
+	}
+
+	err = landlock_restrict_self(ruleset_fd, 0);
+	if (err != 0) {
+		err = errno;
+		n = snprintf(errstr, sizeof(errstr),
+			     "child process landlock_restrict_self: %s\n",
+			     strerror(err));
+		write(STDERR_FILENO, errstr, n + 1);
+		return err;
+	}
+
+	return 0;
+}
+
+static int child_process(int subdir3_fd, int ruleset_fd,
+			 volatile bool *stop_sign)
+{
+	int err;
+
+	err = child_restrict_self(ruleset_fd);
+	if (err != 0)
+		return err;
+
+	while (!*stop_sign) {
+		err = openat(subdir3_fd, TEST_FILE_NAME, O_RDONLY);
+		char errstr[512];
+		int n;
+
+		if (err < 0) {
+			err = errno;
+			n = snprintf(errstr, sizeof(errstr),
+				     "openat(%d -> " SUBDIR3_NAME
+				     ", " TEST_FILE_NAME "): %s\n",
+				     subdir3_fd, strerror(err));
+			write(STDERR_FILENO, errstr, n + 1);
+			return err;
+		}
+		close(err);
+	}
+	return 0;
+}
+
+static void do_test(struct __test_metadata *const _metadata,
+		    struct _test_data_layout *const self)
+{
+	struct shared_region *shr;
+	int child_pid, status, err;
+
+	create_test_file(_metadata, self);
+
+	ASSERT_LE(sizeof(struct shared_region), 4096);
+	shr = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
+		   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+	ASSERT_NE(shr, MAP_FAILED)
+	{
+		TH_LOG("Failed to create shared memory region with mmap: %s",
+		       strerror(errno));
+		return;
+	}
+
+	*(volatile bool *)(&shr->stop) = false;
+
+	child_pid = fork();
+	if (child_pid == 0) {
+		for (int i = 0; i < NUM_SUBDIRS; i++) {
+			if (self->subdir_fds[i] != -1)
+				close(self->subdir_fds[i]);
+			if (self->subdir2_fds[i] != -1)
+				close(self->subdir2_fds[i]);
+		}
+		close(self->base_dir_fd);
+		exit(child_process(self->subdir3_fd, self->ruleset_fd,
+				    &shr->stop));
+		return;
+	}
+
+	ASSERT_NE(-1, child_pid)
+	{
+		TH_LOG("Failed to fork child process: %s", strerror(errno));
+	}
+
+	close(self->ruleset_fd);
+	self->ruleset_fd = -1;
+
+	for (int i = 1; i < NUM_SUBDIRS; i++) {
+		move_subdir3_and_rmdir(_metadata, self, i);
+		if (RANDOM_DELAY_AFTER_MOVE) {
+			struct timespec ts = { .tv_sec = 0,
+					       .tv_nsec = rand() % 400001 };
+			nanosleep(&ts, NULL);
+		}
+	}
+
+	*(volatile bool *)(&shr->stop) = true;
+	err = waitpid(child_pid, &status, 0);
+	ASSERT_NE(-1, err)
+	{
+		TH_LOG("Failed to wait for child process: %s", strerror(errno));
+	}
+	ASSERT_EQ(child_pid, err);
+	status = WEXITSTATUS(status);
+	ASSERT_EQ(0, status)
+	{
+		TH_LOG("Child process terminated with exit code %d", status);
+	}
+}
+
+FIXTURE_SETUP(layout)
+{
+	create_test_dir(_metadata, self);
+	self->subdir3_at = -1;
+	self->subdir3_fd = -1;
+	self->ruleset_fd = -1;
+	for (int i = 0; i < NUM_SUBDIRS; i++) {
+		self->subdir_fds[i] = -1;
+		self->subdir2_fds[i] = -1;
+	}
+};
+
+FIXTURE_TEARDOWN(layout)
+{
+	cleanup_test_dir(_metadata, self);
+}
+
+TEST_F_TIMEOUT(layout, pathwalk_race_test, TEST_TIME + 10)
+{
+	int start_time = time(NULL);
+
+	while (time(NULL) - start_time < TEST_TIME) {
+		create_subdirs(_metadata, self);
+		create_ruleset(_metadata, self);
+		do_test(_metadata, self);
+		cleanup_subdirs(_metadata, self);
+	}
+}
+
+TEST_HARNESS_MAIN
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/3] Restart pathwalk on rename seqcount change
  2025-06-04  0:45 [RFC PATCH 0/3] landlock: walk parent dir with RCU, without taking references Tingmao Wang
  2025-06-04  0:45 ` [RFC PATCH 1/3] landlock: walk parent dir " Tingmao Wang
  2025-06-04  0:45 ` [RFC PATCH 2/3] selftests/landlock: Add fs_race_test Tingmao Wang
@ 2025-06-04  0:45 ` Tingmao Wang
  2025-06-04  0:55   ` Al Viro
  2025-06-04  1:09   ` Tingmao Wang
  2 siblings, 2 replies; 13+ messages in thread
From: Tingmao Wang @ 2025-06-04  0:45 UTC (permalink / raw)
  To: Mickaël Salaün, Song Liu, Al Viro
  Cc: Tingmao Wang, Günther Noack, Jan Kara, Alexei Starovoitov,
	Christian Brauner, linux-security-module, linux-fsdevel

This fixes the issue mentioned in the previous patch, by essentially
having two "modes" for the pathwalk code - in the pathwalk_ref == false
case we don't take references and just inspect `d_parent` (unless we have
to `follow_up`).  In the pathwalk_ref == true case, this is the same as
before.

When we detect any renames during a pathwalk_ref == false walk, we restart
with pathwalk_ref == true, re-initializing the layer masks.  I'm not sure
if this is completely correct in regards to is_dom_check - but seems to
work for now.  I can revisit this later.

Signed-off-by: Tingmao Wang <m@maowtm.org>
---
 security/landlock/fs.c | 109 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 98 insertions(+), 11 deletions(-)

diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 923737412cfa..6dff5fb6b181 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -771,6 +771,9 @@ static bool is_access_to_paths_allowed(
 		_layer_masks_child2[LANDLOCK_NUM_ACCESS_FS];
 	layer_mask_t(*layer_masks_child1)[LANDLOCK_NUM_ACCESS_FS] = NULL,
 	(*layer_masks_child2)[LANDLOCK_NUM_ACCESS_FS] = NULL;
+	unsigned int rename_seqcount;
+	bool pathwalk_ref = false;
+	const struct landlock_rule *rule;
 
 	if (!access_request_parent1 && !access_request_parent2)
 		return true;
@@ -811,6 +814,7 @@ static bool is_access_to_paths_allowed(
 
 	rcu_read_lock();
 
+restart_pathwalk:
 	if (unlikely(dentry_child1)) {
 		landlock_unmask_layers(
 			find_rule_rcu(domain, dentry_child1),
@@ -833,13 +837,32 @@ static bool is_access_to_paths_allowed(
 	}
 
 	walker_path = *path;
+
+	/*
+	 * Attempt to do a pathwalk without taking dentry references first,
+	 * but if any rename happens while we are doing this, give up and do a
+	 * walk with dget_parent instead.  See comments in
+	 * collect_domain_accesses().
+	 */
+
+	if (!pathwalk_ref) {
+		rename_seqcount = read_seqbegin(&rename_lock);
+		if (rename_seqcount % 2 == 1) {
+			pathwalk_ref = true;
+			path_get(&walker_path);
+		}
+	} else {
+		path_get(&walker_path);
+	}
+
+	rule = find_rule_rcu(domain, walker_path.dentry);
+
 	/*
 	 * We need to walk through all the hierarchy to not miss any relevant
 	 * restriction.
 	 */
 	while (true) {
 		struct dentry *parent_dentry;
-		const struct landlock_rule *rule;
 
 		/*
 		 * If at least all accesses allowed on the destination are
@@ -881,7 +904,6 @@ static bool is_access_to_paths_allowed(
 				break;
 		}
 
-		rule = find_rule_rcu(domain, walker_path.dentry);
 		allowed_parent1 = allowed_parent1 ||
 				  landlock_unmask_layers(
 					  rule, access_masked_parent1,
@@ -899,13 +921,16 @@ static bool is_access_to_paths_allowed(
 jump_up:
 		if (walker_path.dentry == walker_path.mnt->mnt_root) {
 			/* follow_up gets the parent and puts the passed in path */
-			path_get(&walker_path);
+			if (!pathwalk_ref)
+				path_get(&walker_path);
 			if (follow_up(&walker_path)) {
-				path_put(&walker_path);
+				if (!pathwalk_ref)
+					path_put(&walker_path);
 				/* Ignores hidden mount points. */
 				goto jump_up;
 			} else {
-				path_put(&walker_path);
+				if (!pathwalk_ref)
+					path_put(&walker_path);
 				/*
 				 * Stops at the real root.  Denies access
 				 * because not all layers have granted access.
@@ -925,10 +950,27 @@ static bool is_access_to_paths_allowed(
 			}
 			break;
 		}
-		parent_dentry = walker_path.dentry->d_parent;
-		walker_path.dentry = parent_dentry;
+		if (!pathwalk_ref) {
+			parent_dentry = walker_path.dentry->d_parent;
+
+			rule = find_rule_rcu(domain, parent_dentry);
+			if (read_seqretry(&rename_lock, rename_seqcount)) {
+				pathwalk_ref = true;
+				goto restart_pathwalk;
+			} else {
+				walker_path.dentry = parent_dentry;
+			}
+		} else {
+			parent_dentry = dget_parent(walker_path.dentry);
+			dput(walker_path.dentry);
+			walker_path.dentry = parent_dentry;
+			rule = find_rule_rcu(domain, walker_path.dentry);
+		}
 	}
 
+	if (pathwalk_ref)
+		path_put(&walker_path);
+
 	rcu_read_unlock();
 
 	if (!allowed_parent1) {
@@ -1040,22 +1082,55 @@ static bool collect_domain_accesses(
 {
 	unsigned long access_dom;
 	bool ret = false;
+	bool pathwalk_ref = false;
+	unsigned int rename_seqcount;
+	const struct landlock_rule *rule;
+	struct dentry *parent_dentry;
 
 	if (WARN_ON_ONCE(!domain || !mnt_root || !dir || !layer_masks_dom))
 		return true;
 	if (is_nouser_or_private(dir))
 		return true;
 
+	rcu_read_lock();
+
+restart_pathwalk:
 	access_dom = landlock_init_layer_masks(domain, LANDLOCK_MASK_ACCESS_FS,
 					       layer_masks_dom,
 					       LANDLOCK_KEY_INODE);
 
-	rcu_read_lock();
+	/*
+	 * Attempt to do a pathwalk without taking dentry references first, but
+	 * if any rename happens while we are doing this, give up and do a walk
+	 * with dget_parent instead.  This prevents wrong denials in the
+	 * presence of a move followed by an immediate rmdir of the old parent,
+	 * where even when both the original and the new parent has allow
+	 * rules, we might still hit a negative dentry (the deleted old parent)
+	 * and being unable to find either rules.
+	 */
+
+	if (!pathwalk_ref) {
+		rename_seqcount = read_seqbegin(&rename_lock);
+		if (rename_seqcount % 2 == 1) {
+			pathwalk_ref = true;
+			dget(dir);
+		}
+	} else {
+		dget(dir);
+	}
+	rule = find_rule_rcu(domain, dir);
+	/*
+	 * We don't need to check rename_seqcount here because we haven't
+	 * followed any d_parent yet, and the d_inode of the path being
+	 * accessed can't change under us as we have ref on path.dentry.  But
+	 * once we start walking up the path, we need to check the seqcount to
+	 * make sure the rule we got isn't based on a wrong/changing/negative
+	 * dentry.
+	 */
 
 	while (true) {
 		/* Gets all layers allowing all domain accesses. */
-		if (landlock_unmask_layers(find_rule_rcu(domain, dir), access_dom,
-					   layer_masks_dom,
+		if (landlock_unmask_layers(rule, access_dom, layer_masks_dom,
 					   ARRAY_SIZE(*layer_masks_dom))) {
 			/*
 			 * Stops when all handled accesses are allowed by at
@@ -1069,9 +1144,21 @@ static bool collect_domain_accesses(
 		if (dir == mnt_root || WARN_ON_ONCE(IS_ROOT(dir)))
 			break;
 
-		dir = dir->d_parent;
+		if (!pathwalk_ref) {
+			parent_dentry = dir->d_parent;
+			rule = find_rule_rcu(domain, dir);
+			if (read_seqretry(&rename_lock, rename_seqcount)) {
+				pathwalk_ref = true;
+				goto restart_pathwalk;
+			} else {
+				dir = parent_dentry;
+			}
+		}
 	}
 
+	if (pathwalk_ref)
+		dput(dir);
+
 	rcu_read_unlock();
 
 	return ret;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/3] Restart pathwalk on rename seqcount change
  2025-06-04  0:45 ` [RFC PATCH 3/3] Restart pathwalk on rename seqcount change Tingmao Wang
@ 2025-06-04  0:55   ` Al Viro
  2025-06-04  1:12     ` Tingmao Wang
  2025-06-04  1:09   ` Tingmao Wang
  1 sibling, 1 reply; 13+ messages in thread
From: Al Viro @ 2025-06-04  0:55 UTC (permalink / raw)
  To: Tingmao Wang
  Cc: Mickaël Salaün, Song Liu, Günther Noack, Jan Kara,
	Alexei Starovoitov, Christian Brauner, linux-security-module,
	linux-fsdevel

On Wed, Jun 04, 2025 at 01:45:45AM +0100, Tingmao Wang wrote:
> +		rename_seqcount = read_seqbegin(&rename_lock);
> +		if (rename_seqcount % 2 == 1) {

Please, describe the condition when that can happen, preferably
along with a reproducer.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/3] Restart pathwalk on rename seqcount change
  2025-06-04  0:45 ` [RFC PATCH 3/3] Restart pathwalk on rename seqcount change Tingmao Wang
  2025-06-04  0:55   ` Al Viro
@ 2025-06-04  1:09   ` Tingmao Wang
  1 sibling, 0 replies; 13+ messages in thread
From: Tingmao Wang @ 2025-06-04  1:09 UTC (permalink / raw)
  To: Mickaël Salaün, Song Liu, Al Viro
  Cc: Günther Noack, Jan Kara, Alexei Starovoitov,
	Christian Brauner, linux-security-module, linux-fsdevel

On 6/4/25 01:45, Tingmao Wang wrote:
> This fixes the issue mentioned in the previous patch, by essentially
> having two "modes" for the pathwalk code - in the pathwalk_ref == false
> case we don't take references and just inspect `d_parent` (unless we have
> to `follow_up`).  In the pathwalk_ref == true case, this is the same as
> before.
> 
> When we detect any renames during a pathwalk_ref == false walk, we restart
> with pathwalk_ref == true, re-initializing the layer masks.  I'm not sure
> if this is completely correct in regards to is_dom_check - but seems to
> work for now.  I can revisit this later.
> 
> Signed-off-by: Tingmao Wang <m@maowtm.org>
> ---
>  security/landlock/fs.c | 109 ++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 98 insertions(+), 11 deletions(-)
> 
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index 923737412cfa..6dff5fb6b181 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -771,6 +771,9 @@ static bool is_access_to_paths_allowed(
>  		_layer_masks_child2[LANDLOCK_NUM_ACCESS_FS];
>  	layer_mask_t(*layer_masks_child1)[LANDLOCK_NUM_ACCESS_FS] = NULL,
>  	(*layer_masks_child2)[LANDLOCK_NUM_ACCESS_FS] = NULL;
> +	unsigned int rename_seqcount;
> +	bool pathwalk_ref = false;
> +	const struct landlock_rule *rule;
>  
>  	if (!access_request_parent1 && !access_request_parent2)
>  		return true;
> @@ -811,6 +814,7 @@ static bool is_access_to_paths_allowed(
>  
>  	rcu_read_lock();
>  
> +restart_pathwalk:
>  	if (unlikely(dentry_child1)) {
>  		landlock_unmask_layers(
>  			find_rule_rcu(domain, dentry_child1),
> @@ -833,13 +837,32 @@ static bool is_access_to_paths_allowed(
>  	}
>  
>  	walker_path = *path;
> +
> +	/*
> +	 * Attempt to do a pathwalk without taking dentry references first,
> +	 * but if any rename happens while we are doing this, give up and do a
> +	 * walk with dget_parent instead.  See comments in
> +	 * collect_domain_accesses().
> +	 */
> +
> +	if (!pathwalk_ref) {
> +		rename_seqcount = read_seqbegin(&rename_lock);
> +		if (rename_seqcount % 2 == 1) {
> +			pathwalk_ref = true;
> +			path_get(&walker_path);
> +		}
> +	} else {
> +		path_get(&walker_path);
> +	}
> +
> +	rule = find_rule_rcu(domain, walker_path.dentry);
> +
>  	/*
>  	 * We need to walk through all the hierarchy to not miss any relevant
>  	 * restriction.
>  	 */
>  	while (true) {
>  		struct dentry *parent_dentry;
> -		const struct landlock_rule *rule;
>  
>  		/*
>  		 * If at least all accesses allowed on the destination are
> @@ -881,7 +904,6 @@ static bool is_access_to_paths_allowed(
>  				break;
>  		}
>  
> -		rule = find_rule_rcu(domain, walker_path.dentry);
>  		allowed_parent1 = allowed_parent1 ||
>  				  landlock_unmask_layers(
>  					  rule, access_masked_parent1,
> @@ -899,13 +921,16 @@ static bool is_access_to_paths_allowed(
>  jump_up:
>  		if (walker_path.dentry == walker_path.mnt->mnt_root) {
>  			/* follow_up gets the parent and puts the passed in path */
> -			path_get(&walker_path);
> +			if (!pathwalk_ref)
> +				path_get(&walker_path);
>  			if (follow_up(&walker_path)) {
> -				path_put(&walker_path);
> +				if (!pathwalk_ref)
> +					path_put(&walker_path);
>  				/* Ignores hidden mount points. */
>  				goto jump_up;
>  			} else {
> -				path_put(&walker_path);
> +				if (!pathwalk_ref)
> +					path_put(&walker_path);
>  				/*
>  				 * Stops at the real root.  Denies access
>  				 * because not all layers have granted access.
> @@ -925,10 +950,27 @@ static bool is_access_to_paths_allowed(
>  			}
>  			break;
>  		}
> -		parent_dentry = walker_path.dentry->d_parent;
> -		walker_path.dentry = parent_dentry;
> +		if (!pathwalk_ref) {
> +			parent_dentry = walker_path.dentry->d_parent;
> +
> +			rule = find_rule_rcu(domain, parent_dentry);
> +			if (read_seqretry(&rename_lock, rename_seqcount)) {
> +				pathwalk_ref = true;
> +				goto restart_pathwalk;
> +			} else {
> +				walker_path.dentry = parent_dentry;
> +			}
> +		} else {
> +			parent_dentry = dget_parent(walker_path.dentry);
> +			dput(walker_path.dentry);
> +			walker_path.dentry = parent_dentry;
> +			rule = find_rule_rcu(domain, walker_path.dentry);
> +		}
>  	}
>  
> +	if (pathwalk_ref)
> +		path_put(&walker_path);
> +
>  	rcu_read_unlock();
>  
>  	if (!allowed_parent1) {
> @@ -1040,22 +1082,55 @@ static bool collect_domain_accesses(
>  {
>  	unsigned long access_dom;
>  	bool ret = false;
> +	bool pathwalk_ref = false;
> +	unsigned int rename_seqcount;
> +	const struct landlock_rule *rule;
> +	struct dentry *parent_dentry;
>  
>  	if (WARN_ON_ONCE(!domain || !mnt_root || !dir || !layer_masks_dom))
>  		return true;
>  	if (is_nouser_or_private(dir))
>  		return true;
>  
> +	rcu_read_lock();
> +
> +restart_pathwalk:
>  	access_dom = landlock_init_layer_masks(domain, LANDLOCK_MASK_ACCESS_FS,
>  					       layer_masks_dom,
>  					       LANDLOCK_KEY_INODE);
>  
> -	rcu_read_lock();
> +	/*
> +	 * Attempt to do a pathwalk without taking dentry references first, but
> +	 * if any rename happens while we are doing this, give up and do a walk
> +	 * with dget_parent instead.  This prevents wrong denials in the
> +	 * presence of a move followed by an immediate rmdir of the old parent,
> +	 * where even when both the original and the new parent has allow
> +	 * rules, we might still hit a negative dentry (the deleted old parent)
> +	 * and being unable to find either rules.
> +	 */
> +
> +	if (!pathwalk_ref) {
> +		rename_seqcount = read_seqbegin(&rename_lock);
> +		if (rename_seqcount % 2 == 1) {
> +			pathwalk_ref = true;
> +			dget(dir);
> +		}
> +	} else {
> +		dget(dir);
> +	}
> +	rule = find_rule_rcu(domain, dir);
> +	/*
> +	 * We don't need to check rename_seqcount here because we haven't
> +	 * followed any d_parent yet, and the d_inode of the path being
> +	 * accessed can't change under us as we have ref on path.dentry.  But
> +	 * once we start walking up the path, we need to check the seqcount to
> +	 * make sure the rule we got isn't based on a wrong/changing/negative
> +	 * dentry.
> +	 */
>  
>  	while (true) {
>  		/* Gets all layers allowing all domain accesses. */
> -		if (landlock_unmask_layers(find_rule_rcu(domain, dir), access_dom,
> -					   layer_masks_dom,
> +		if (landlock_unmask_layers(rule, access_dom, layer_masks_dom,
>  					   ARRAY_SIZE(*layer_masks_dom))) {
>  			/*
>  			 * Stops when all handled accesses are allowed by at
> @@ -1069,9 +1144,21 @@ static bool collect_domain_accesses(
>  		if (dir == mnt_root || WARN_ON_ONCE(IS_ROOT(dir)))
>  			break;
>  
> -		dir = dir->d_parent;
> +		if (!pathwalk_ref) {
> +			parent_dentry = dir->d_parent;
> +			rule = find_rule_rcu(domain, dir);
> +			if (read_seqretry(&rename_lock, rename_seqcount)) {
> +				pathwalk_ref = true;
> +				goto restart_pathwalk;
> +			} else {
> +				dir = parent_dentry;
> +			}
> +		}

Forgot else branch here

diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 6dff5fb6b181..885121b1beef 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1153,6 +1153,11 @@ static bool collect_domain_accesses(
 			} else {
 				dir = parent_dentry;
 			}
+		} else {
+			parent_dentry = dget_parent(dir);
+			dput(dir);
+			dir = parent_dentry;
+			rule = find_rule_rcu(domain, dir);
 		}
 	}
 

>  	}
>  
> +	if (pathwalk_ref)
> +		dput(dir);
> +
>  	rcu_read_unlock();
>  
>  	return ret;


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/3] Restart pathwalk on rename seqcount change
  2025-06-04  0:55   ` Al Viro
@ 2025-06-04  1:12     ` Tingmao Wang
  2025-06-04  2:21       ` Al Viro
  0 siblings, 1 reply; 13+ messages in thread
From: Tingmao Wang @ 2025-06-04  1:12 UTC (permalink / raw)
  To: Al Viro
  Cc: Mickaël Salaün, Song Liu, Günther Noack, Jan Kara,
	Alexei Starovoitov, Christian Brauner, linux-security-module,
	linux-fsdevel

On 6/4/25 01:55, Al Viro wrote:
> On Wed, Jun 04, 2025 at 01:45:45AM +0100, Tingmao Wang wrote:
>> +		rename_seqcount = read_seqbegin(&rename_lock);
>> +		if (rename_seqcount % 2 == 1) {
> 
> Please, describe the condition when that can happen, preferably
> along with a reproducer.

My understanding is that when a rename is in progress the seqcount is odd,
is that correct?

If that's the case, then the fs_race_test in patch 2 should act as a
reproducer, since it's constantly moving the directory.

I can add a comment to explain this, thanks for pointing out.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/3] Restart pathwalk on rename seqcount change
  2025-06-04  1:12     ` Tingmao Wang
@ 2025-06-04  2:21       ` Al Viro
  2025-06-04 18:56         ` Tingmao Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Al Viro @ 2025-06-04  2:21 UTC (permalink / raw)
  To: Tingmao Wang
  Cc: Mickaël Salaün, Song Liu, Günther Noack, Jan Kara,
	Alexei Starovoitov, Christian Brauner, linux-security-module,
	linux-fsdevel

On Wed, Jun 04, 2025 at 02:12:11AM +0100, Tingmao Wang wrote:
> On 6/4/25 01:55, Al Viro wrote:
> > On Wed, Jun 04, 2025 at 01:45:45AM +0100, Tingmao Wang wrote:
> >> +		rename_seqcount = read_seqbegin(&rename_lock);
> >> +		if (rename_seqcount % 2 == 1) {
> > 
> > Please, describe the condition when that can happen, preferably
> > along with a reproducer.
> 
> My understanding is that when a rename is in progress the seqcount is odd,
> is that correct?
> 
> If that's the case, then the fs_race_test in patch 2 should act as a
> reproducer, since it's constantly moving the directory.
> 
> I can add a comment to explain this, thanks for pointing out.

Please, read through the header declaring those primitives and read the
documentation it refers to - it's useful for background.

What's more, look at the area covered by rename_lock - I seriously suspect
that you are greatly overestimating it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] landlock: walk parent dir without taking references
  2025-06-04  0:45 ` [RFC PATCH 1/3] landlock: walk parent dir " Tingmao Wang
@ 2025-06-04 17:15   ` Mickaël Salaün
  2025-06-04 21:05     ` Tingmao Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Mickaël Salaün @ 2025-06-04 17:15 UTC (permalink / raw)
  To: Tingmao Wang
  Cc: Song Liu, Al Viro, Günther Noack, Jan Kara,
	Alexei Starovoitov, Christian Brauner, linux-security-module,
	linux-fsdevel

On Wed, Jun 04, 2025 at 01:45:43AM +0100, Tingmao Wang wrote:
> This commit replaces dget_parent with a direct read of d_parent. By
> holding rcu read lock we should be safe in terms of not reading freed
> memory, but this is still problematic due to move+unlink, as will be shown
> with the test in the next commit.
> 
> Note that follow_up is still used when walking up a mountpoint.
> 
> Signed-off-by: Tingmao Wang <m@maowtm.org>
> ---
>  security/landlock/fs.c | 40 ++++++++++++++++++++++------------------
>  1 file changed, 22 insertions(+), 18 deletions(-)
> 
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index 6fee7c20f64d..923737412cfa 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -361,7 +361,7 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
>   * Returns NULL if no rule is found or if @dentry is negative.
>   */
>  static const struct landlock_rule *
> -find_rule(const struct landlock_ruleset *const domain,
> +find_rule_rcu(const struct landlock_ruleset *const domain,
>  	  const struct dentry *const dentry)
>  {
>  	const struct landlock_rule *rule;
> @@ -375,10 +375,10 @@ find_rule(const struct landlock_ruleset *const domain,
>  		return NULL;
>  
>  	inode = d_backing_inode(dentry);
> -	rcu_read_lock();
> +	if (unlikely(!inode))
> +		return NULL;
>  	id.key.object = rcu_dereference(landlock_inode(inode)->object);
>  	rule = landlock_find_rule(domain, id);
> -	rcu_read_unlock();
>  	return rule;
>  }
>  
> @@ -809,9 +809,11 @@ static bool is_access_to_paths_allowed(
>  		is_dom_check = false;
>  	}
>  
> +	rcu_read_lock();
> +
>  	if (unlikely(dentry_child1)) {
>  		landlock_unmask_layers(
> -			find_rule(domain, dentry_child1),
> +			find_rule_rcu(domain, dentry_child1),
>  			landlock_init_layer_masks(
>  				domain, LANDLOCK_MASK_ACCESS_FS,
>  				&_layer_masks_child1, LANDLOCK_KEY_INODE),
> @@ -821,7 +823,7 @@ static bool is_access_to_paths_allowed(
>  	}
>  	if (unlikely(dentry_child2)) {
>  		landlock_unmask_layers(
> -			find_rule(domain, dentry_child2),
> +			find_rule_rcu(domain, dentry_child2),
>  			landlock_init_layer_masks(
>  				domain, LANDLOCK_MASK_ACCESS_FS,
>  				&_layer_masks_child2, LANDLOCK_KEY_INODE),
> @@ -831,7 +833,6 @@ static bool is_access_to_paths_allowed(
>  	}
>  
>  	walker_path = *path;
> -	path_get(&walker_path);
>  	/*
>  	 * We need to walk through all the hierarchy to not miss any relevant
>  	 * restriction.
> @@ -880,7 +881,7 @@ static bool is_access_to_paths_allowed(
>  				break;
>  		}
>  
> -		rule = find_rule(domain, walker_path.dentry);
> +		rule = find_rule_rcu(domain, walker_path.dentry);
>  		allowed_parent1 = allowed_parent1 ||
>  				  landlock_unmask_layers(
>  					  rule, access_masked_parent1,
> @@ -897,10 +898,14 @@ static bool is_access_to_paths_allowed(
>  			break;
>  jump_up:
>  		if (walker_path.dentry == walker_path.mnt->mnt_root) {
> +			/* follow_up gets the parent and puts the passed in path */
> +			path_get(&walker_path);
>  			if (follow_up(&walker_path)) {
> +				path_put(&walker_path);

path_put() cannot be safely called in a RCU read-side critical section
because it can free memory which can sleep, and also because it can wait
for a lock.  However, we can call rcu_read_unlock() before and
rcu_read_lock() after (if we hold a reference).

>  				/* Ignores hidden mount points. */
>  				goto jump_up;
>  			} else {
> +				path_put(&walker_path);
>  				/*
>  				 * Stops at the real root.  Denies access
>  				 * because not all layers have granted access.
> @@ -920,11 +925,11 @@ static bool is_access_to_paths_allowed(
>  			}
>  			break;
>  		}
> -		parent_dentry = dget_parent(walker_path.dentry);
> -		dput(walker_path.dentry);
> +		parent_dentry = walker_path.dentry->d_parent;
>  		walker_path.dentry = parent_dentry;
>  	}
> -	path_put(&walker_path);
> +
> +	rcu_read_unlock();
>  
>  	if (!allowed_parent1) {
>  		log_request_parent1->type = LANDLOCK_REQUEST_FS_ACCESS;
> @@ -1045,12 +1050,11 @@ static bool collect_domain_accesses(
>  					       layer_masks_dom,
>  					       LANDLOCK_KEY_INODE);
>  
> -	dget(dir);
> -	while (true) {
> -		struct dentry *parent_dentry;
> +	rcu_read_lock();
>  
> +	while (true) {
>  		/* Gets all layers allowing all domain accesses. */
> -		if (landlock_unmask_layers(find_rule(domain, dir), access_dom,
> +		if (landlock_unmask_layers(find_rule_rcu(domain, dir), access_dom,
>  					   layer_masks_dom,
>  					   ARRAY_SIZE(*layer_masks_dom))) {
>  			/*
> @@ -1065,11 +1069,11 @@ static bool collect_domain_accesses(
>  		if (dir == mnt_root || WARN_ON_ONCE(IS_ROOT(dir)))
>  			break;
>  
> -		parent_dentry = dget_parent(dir);
> -		dput(dir);
> -		dir = parent_dentry;
> +		dir = dir->d_parent;
>  	}
> -	dput(dir);
> +
> +	rcu_read_unlock();
> +
>  	return ret;
>  }
>  
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/3] Restart pathwalk on rename seqcount change
  2025-06-04  2:21       ` Al Viro
@ 2025-06-04 18:56         ` Tingmao Wang
  2025-06-04 19:17           ` Tingmao Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Tingmao Wang @ 2025-06-04 18:56 UTC (permalink / raw)
  To: Al Viro, Mickaël Salaün
  Cc: Song Liu, Günther Noack, Jan Kara, Alexei Starovoitov,
	Christian Brauner, linux-security-module, linux-fsdevel

On 6/4/25 03:21, Al Viro wrote:
> On Wed, Jun 04, 2025 at 02:12:11AM +0100, Tingmao Wang wrote:
>> On 6/4/25 01:55, Al Viro wrote:
>>> On Wed, Jun 04, 2025 at 01:45:45AM +0100, Tingmao Wang wrote:
>>>> +		rename_seqcount = read_seqbegin(&rename_lock);
>>>> +		if (rename_seqcount % 2 == 1) {
>>>
>>> Please, describe the condition when that can happen, preferably
>>> along with a reproducer.
>>
>> My understanding is that when a rename is in progress the seqcount is odd,
>> is that correct?
>>
>> If that's the case, then the fs_race_test in patch 2 should act as a
>> reproducer, since it's constantly moving the directory.
>>
>> I can add a comment to explain this, thanks for pointing out.
>
> Please, read through the header declaring those primitives and read the
> documentation it refers to - it's useful for background.

Ok, so I didn't realize read_seqbegin actually waits for the seqcount to
turn even.  I did read the header earlier when following dget_parent but
probably misremembered and mixed raw_seqcount_begin with read_seqbegin.

>
> What's more, look at the area covered by rename_lock - I seriously suspect
> that you are greatly overestimating it.

Admittedly "when a rename is in progress" is a vague statement.  Looking
at what takes rename_lock in the code, it's only when we actually do
d_move where we take this lock (plus some other places), and the critical
section isn't very large, and does not contain any waits etc.

If we keep read_seqbegin, then that gives landlock more opportunity to do
a reference-less parent walk, but at the expense that a d_move anywhere,
even if it doesn't affect anything we're currently looking at, will
temporarily block this landlocked application (even if not for very long),
and multiple concurrent renames might cause us to wait for longer (but
probably won't starve us since we just need one "cycle" where rename
seqcount is even).

Since we can still safely do a parent walk, just needing to take dentry
references on our way, we could simply fallback to that in this situation.
i.e.  we can use raw_seqcount_begin and keep the seqcount & 1 check.

Now, there is the argument that if d_move is very quick, then it might be
worth waiting for it to finish, and we will fallback to the original
parent walk if the seqcount changes again.  I'm not sure which is best,
but I'm inclining towards changing this to raw_seqcount_begin, as this is
purely an optimization, and we do not _need_ to avoid concurrent renames.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/3] Restart pathwalk on rename seqcount change
  2025-06-04 18:56         ` Tingmao Wang
@ 2025-06-04 19:17           ` Tingmao Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Tingmao Wang @ 2025-06-04 19:17 UTC (permalink / raw)
  To: Al Viro, Mickaël Salaün
  Cc: Song Liu, Günther Noack, Jan Kara, Alexei Starovoitov,
	Christian Brauner, linux-security-module, linux-fsdevel

On 6/4/25 19:56, Tingmao Wang wrote:
> On 6/4/25 03:21, Al Viro wrote:
>> On Wed, Jun 04, 2025 at 02:12:11AM +0100, Tingmao Wang wrote:
>>> On 6/4/25 01:55, Al Viro wrote:
>>>> On Wed, Jun 04, 2025 at 01:45:45AM +0100, Tingmao Wang wrote:
>>>>> +		rename_seqcount = read_seqbegin(&rename_lock);
>>>>> +		if (rename_seqcount % 2 == 1) {
>>>>
>>>> Please, describe the condition when that can happen, preferably
>>>> along with a reproducer.
>>>
>>> My understanding is that when a rename is in progress the seqcount is odd,
>>> is that correct?
>>>
>>> If that's the case, then the fs_race_test in patch 2 should act as a
>>> reproducer, since it's constantly moving the directory.
>>>
>>> I can add a comment to explain this, thanks for pointing out.
>>
>> Please, read through the header declaring those primitives and read the
>> documentation it refers to - it's useful for background.
> 
> Ok, so I didn't realize read_seqbegin actually waits for the seqcount to
> turn even.  I did read the header earlier when following dget_parent but
> probably misremembered and mixed raw_seqcount_begin with read_seqbegin.

Right, after more careful looking I think what I actually want here is
raw_read_seqcount.  My apologies.

> 
>>
>> What's more, look at the area covered by rename_lock - I seriously suspect
>> that you are greatly overestimating it.
> 
> Admittedly "when a rename is in progress" is a vague statement.  Looking
> at what takes rename_lock in the code, it's only when we actually do
> d_move where we take this lock (plus some other places), and the critical
> section isn't very large, and does not contain any waits etc.
> 
> If we keep read_seqbegin, then that gives landlock more opportunity to do
> a reference-less parent walk, but at the expense that a d_move anywhere,
> even if it doesn't affect anything we're currently looking at, will
> temporarily block this landlocked application (even if not for very long),
> and multiple concurrent renames might cause us to wait for longer (but
> probably won't starve us since we just need one "cycle" where rename
> seqcount is even).
> 
> Since we can still safely do a parent walk, just needing to take dentry
> references on our way, we could simply fallback to that in this situation.
> i.e.  we can use raw_seqcount_begin and keep the seqcount & 1 check.

This will be raw_read_seqcount(&rename_lock.seqcount)

> 
> Now, there is the argument that if d_move is very quick, then it might be
> worth waiting for it to finish, and we will fallback to the original
> parent walk if the seqcount changes again.  I'm not sure which is best,
> but I'm inclining towards changing this to raw_seqcount_begin, as this is
> purely an optimization, and we do not _need_ to avoid concurrent renames.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] landlock: walk parent dir without taking references
  2025-06-04 17:15   ` Mickaël Salaün
@ 2025-06-04 21:05     ` Tingmao Wang
  2025-06-06 10:25       ` Christian Brauner
  0 siblings, 1 reply; 13+ messages in thread
From: Tingmao Wang @ 2025-06-04 21:05 UTC (permalink / raw)
  To: Al Viro, Mickaël Salaün
  Cc: Song Liu, Günther Noack, Jan Kara, Alexei Starovoitov,
	Christian Brauner, linux-security-module, linux-fsdevel

On 6/4/25 18:15, Mickaël Salaün wrote:
> On Wed, Jun 04, 2025 at 01:45:43AM +0100, Tingmao Wang wrote:
>> [..]
>> @@ -897,10 +898,14 @@ static bool is_access_to_paths_allowed(
>>  			break;
>>  jump_up:
>>  		if (walker_path.dentry == walker_path.mnt->mnt_root) {
>> +			/* follow_up gets the parent and puts the passed in path */
>> +			path_get(&walker_path);
>>  			if (follow_up(&walker_path)) {
>> +				path_put(&walker_path);
>
> path_put() cannot be safely called in a RCU read-side critical section
> because it can free memory which can sleep, and also because it can wait
> for a lock.  However, we can call rcu_read_unlock() before and
> rcu_read_lock() after (if we hold a reference).

Thanks for pointing this out.

Actually I think this might be even more tricky.  I'm not sure if we can
always rely on the dentry still being there after rcu_read_unlock(),
regardless of whether we do a path_get() before unlocking...  Even when
we're inside a RCU read-side critical section, my understanding is that if
a dentry reaches zero refcount and is selected to be freed (either
immediately or by LRU) from another CPU, dentry_free will do
call_rcu(&dentry->d_u.d_rcu, __d_free) which will cause the dentry to
immediately be freed after our rcu_read_unlock(), regardless of whether we
had a path_get() before that.

In fact because lockref_mark_dead sets the refcount to negative,
path_get() would simply be wrong.  We could use lockref_get_not_dead()
instead, and only continue if we actually acquired a reference, but then
we have the problem of not being able to dput() the dentry acquired by
follow_up(), without risking it getting killed before we can enter RCU
again (although I do wonder if it's possible for it to be killed, given
that there is an active mountpoint on it that we hold a reference for?).

While we could probably do something like "defer the dput() until we next
reach a mountpoint and can rcu_read_unlock()", or use lockref_put_return()
and assert that the dentry must still have refcount > 0 since it's an
in-use mountpoint, after a lot of thinking it seems to me the only clean
solution is to have a mechanism of walking up mounts completely
reference-free.  Maybe what we actually need is choose_mountpoint_rcu().

That function is private, so I guess a question for Al and other VFS
people here is, can we potentially expose an equivalent publicly?
(Perhaps it would only do effectively what __prepend_path in d_path.c
does, and we can track the mount_lock seqcount outside.  Also the fact
that throughout all this we have a valid reference to the leaf dentry we
started from, to me should mean that the mount can't disappear under us
anyway)

>
>>  				/* Ignores hidden mount points. */
>>  				goto jump_up;
>>  			} else {
>> +				path_put(&walker_path);
>>  				/*
>>  				 * Stops at the real root.  Denies access
>>  				 * because not all layers have granted access.
>> @@ -920,11 +925,11 @@ static bool is_access_to_paths_allowed(
>>  			}
>>  			break;
>>  		}
>> -		parent_dentry = dget_parent(walker_path.dentry);
>> -		dput(walker_path.dentry);
>> +		parent_dentry = walker_path.dentry->d_parent;
>>  		walker_path.dentry = parent_dentry;
>>  	}
>> -	path_put(&walker_path);
>> +
>> +	rcu_read_unlock();
>>
>>  	if (!allowed_parent1) {
>>  		log_request_parent1->type = LANDLOCK_REQUEST_FS_ACCESS;
>> @@ -1045,12 +1050,11 @@ static bool collect_domain_accesses(
>>  					       layer_masks_dom,
>>  					       LANDLOCK_KEY_INODE);
>>
>> -	dget(dir);
>> -	while (true) {
>> -		struct dentry *parent_dentry;
>> +	rcu_read_lock();
>>
>> +	while (true) {
>>  		/* Gets all layers allowing all domain accesses. */
>> -		if (landlock_unmask_layers(find_rule(domain, dir), access_dom,
>> +		if (landlock_unmask_layers(find_rule_rcu(domain, dir), access_dom,
>>  					   layer_masks_dom,
>>  					   ARRAY_SIZE(*layer_masks_dom))) {
>>  			/*
>> @@ -1065,11 +1069,11 @@ static bool collect_domain_accesses(
>>  		if (dir == mnt_root || WARN_ON_ONCE(IS_ROOT(dir)))
>>  			break;
>>
>> -		parent_dentry = dget_parent(dir);
>> -		dput(dir);
>> -		dir = parent_dentry;
>> +		dir = dir->d_parent;
>>  	}
>> -	dput(dir);
>> +
>> +	rcu_read_unlock();
>> +
>>  	return ret;
>>  }
>>
>> --
>> 2.49.0
>>
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] landlock: walk parent dir without taking references
  2025-06-04 21:05     ` Tingmao Wang
@ 2025-06-06 10:25       ` Christian Brauner
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2025-06-06 10:25 UTC (permalink / raw)
  To: Tingmao Wang
  Cc: Al Viro, Mickaël Salaün, Song Liu, Günther Noack,
	Jan Kara, Alexei Starovoitov, linux-security-module,
	linux-fsdevel

On Wed, Jun 04, 2025 at 10:05:01PM +0100, Tingmao Wang wrote:
> On 6/4/25 18:15, Mickaël Salaün wrote:
> > On Wed, Jun 04, 2025 at 01:45:43AM +0100, Tingmao Wang wrote:
> >> [..]
> >> @@ -897,10 +898,14 @@ static bool is_access_to_paths_allowed(
> >>  			break;
> >>  jump_up:
> >>  		if (walker_path.dentry == walker_path.mnt->mnt_root) {
> >> +			/* follow_up gets the parent and puts the passed in path */
> >> +			path_get(&walker_path);
> >>  			if (follow_up(&walker_path)) {
> >> +				path_put(&walker_path);
> >
> > path_put() cannot be safely called in a RCU read-side critical section
> > because it can free memory which can sleep, and also because it can wait
> > for a lock.  However, we can call rcu_read_unlock() before and
> > rcu_read_lock() after (if we hold a reference).
> 
> Thanks for pointing this out.
> 
> Actually I think this might be even more tricky.  I'm not sure if we can
> always rely on the dentry still being there after rcu_read_unlock(),
> regardless of whether we do a path_get() before unlocking...  Even when
> we're inside a RCU read-side critical section, my understanding is that if
> a dentry reaches zero refcount and is selected to be freed (either
> immediately or by LRU) from another CPU, dentry_free will do
> call_rcu(&dentry->d_u.d_rcu, __d_free) which will cause the dentry to
> immediately be freed after our rcu_read_unlock(), regardless of whether we
> had a path_get() before that.
> 
> In fact because lockref_mark_dead sets the refcount to negative,
> path_get() would simply be wrong.  We could use lockref_get_not_dead()
> instead, and only continue if we actually acquired a reference, but then
> we have the problem of not being able to dput() the dentry acquired by
> follow_up(), without risking it getting killed before we can enter RCU
> again (although I do wonder if it's possible for it to be killed, given
> that there is an active mountpoint on it that we hold a reference for?).
> 
> While we could probably do something like "defer the dput() until we next
> reach a mountpoint and can rcu_read_unlock()", or use lockref_put_return()
> and assert that the dentry must still have refcount > 0 since it's an
> in-use mountpoint, after a lot of thinking it seems to me the only clean
> solution is to have a mechanism of walking up mounts completely
> reference-free.  Maybe what we actually need is choose_mountpoint_rcu().
> 
> That function is private, so I guess a question for Al and other VFS
> people here is, can we potentially expose an equivalent publicly?
> (Perhaps it would only do effectively what __prepend_path in d_path.c
> does, and we can track the mount_lock seqcount outside.  Also the fact
> that throughout all this we have a valid reference to the leaf dentry we
> started from, to me should mean that the mount can't disappear under us
> anyway)

I have a very hard time warming up to the idea that you're doing
reference less path walk out of sight of the core VFS. That smells like
a massive landmine. It's barely correct now and it won't get better if
this code bitrots when Al or I massage other codepaths. And that happens
a lot.

We're exposing new infrastructure for the BPF LSM layer that is targeted
also to help Landlock. Use the helper exposed for them. It makes things
easier for all of us.

> 
> >
> >>  				/* Ignores hidden mount points. */
> >>  				goto jump_up;
> >>  			} else {
> >> +				path_put(&walker_path);
> >>  				/*
> >>  				 * Stops at the real root.  Denies access
> >>  				 * because not all layers have granted access.
> >> @@ -920,11 +925,11 @@ static bool is_access_to_paths_allowed(
> >>  			}
> >>  			break;
> >>  		}
> >> -		parent_dentry = dget_parent(walker_path.dentry);
> >> -		dput(walker_path.dentry);
> >> +		parent_dentry = walker_path.dentry->d_parent;
> >>  		walker_path.dentry = parent_dentry;
> >>  	}
> >> -	path_put(&walker_path);
> >> +
> >> +	rcu_read_unlock();
> >>
> >>  	if (!allowed_parent1) {
> >>  		log_request_parent1->type = LANDLOCK_REQUEST_FS_ACCESS;
> >> @@ -1045,12 +1050,11 @@ static bool collect_domain_accesses(
> >>  					       layer_masks_dom,
> >>  					       LANDLOCK_KEY_INODE);
> >>
> >> -	dget(dir);
> >> -	while (true) {
> >> -		struct dentry *parent_dentry;
> >> +	rcu_read_lock();
> >>
> >> +	while (true) {
> >>  		/* Gets all layers allowing all domain accesses. */
> >> -		if (landlock_unmask_layers(find_rule(domain, dir), access_dom,
> >> +		if (landlock_unmask_layers(find_rule_rcu(domain, dir), access_dom,
> >>  					   layer_masks_dom,
> >>  					   ARRAY_SIZE(*layer_masks_dom))) {
> >>  			/*
> >> @@ -1065,11 +1069,11 @@ static bool collect_domain_accesses(
> >>  		if (dir == mnt_root || WARN_ON_ONCE(IS_ROOT(dir)))
> >>  			break;
> >>
> >> -		parent_dentry = dget_parent(dir);
> >> -		dput(dir);
> >> -		dir = parent_dentry;
> >> +		dir = dir->d_parent;
> >>  	}
> >> -	dput(dir);
> >> +
> >> +	rcu_read_unlock();
> >> +
> >>  	return ret;
> >>  }
> >>
> >> --
> >> 2.49.0
> >>
> >>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-06-06 10:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04  0:45 [RFC PATCH 0/3] landlock: walk parent dir with RCU, without taking references Tingmao Wang
2025-06-04  0:45 ` [RFC PATCH 1/3] landlock: walk parent dir " Tingmao Wang
2025-06-04 17:15   ` Mickaël Salaün
2025-06-04 21:05     ` Tingmao Wang
2025-06-06 10:25       ` Christian Brauner
2025-06-04  0:45 ` [RFC PATCH 2/3] selftests/landlock: Add fs_race_test Tingmao Wang
2025-06-04  0:45 ` [RFC PATCH 3/3] Restart pathwalk on rename seqcount change Tingmao Wang
2025-06-04  0:55   ` Al Viro
2025-06-04  1:12     ` Tingmao Wang
2025-06-04  2:21       ` Al Viro
2025-06-04 18:56         ` Tingmao Wang
2025-06-04 19:17           ` Tingmao Wang
2025-06-04  1:09   ` Tingmao Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).