ovl: Allow layers from anonymous mount namespaces?

linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ovl: Allow layers from anonymous mount namespaces?
@ 2025-01-23  4:18 Mike Baynton
  2025-01-23 19:19 ` [RFC PATCH 1/2] fs: allow detached mounts in clone_private_mount() Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Mike Baynton @ 2025-01-23  4:18 UTC (permalink / raw)
  To: overlayfs, brauner

Hi,
I've been eagerly awaiting the arrival of lowerdir+ by file handle, as
it looks likely to be well-suited to simplifying the task a container
runtime must take on in order to provide a set of properly idmapped
lower layers for a user namespaced container. Currently in containerd,
this is done by creating bindmounts for each required lower layer in
order to apply idmapping to them. Each of these bindmounts must be
briefly attached to some path-resolvable mountpoint before the overlay
is created, which seems less than ideal and is contributing to some
cleanup headaches e.g. when other software that may be present jumps on
the new mount and starts security scanning it or whatnot.

In order to better isolate the idmap bindmounts I was hoping to do
something like:

ovl_ctx = fsopen("overlay", FSOPEN_CLOEXEC);

opfd = open_tree(-1, "/path/to/unmapped/layer",
OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC);
mount_setattr(opfd, "", AT_EMPTY_PATH, /* attrs to set a userns_fd */);
dfd = openat(opfd, ".", O_DIRECTORY, mode);

fsconfig(ovl_ctx, FSCONFIG_SET_FD, "lowerdir+", dfd);
// ...other ovl_ctx fsconfigs...
fsconfig(ovl_ctx, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

...and this *almost* works in 6.13. The result of something like this is
that the FSCONFIG_CMD_CREATE fails, with "overlayfs: failed to clone
lowerpath" in dmesg. Investigating a bit, the cause is that the mount
represented by opfd is placed in a newly allocated mount namespace
containing only itself. When overlayfs then tries to make its own
private copy of that mount, it uses clone_private_mount() which subjects
any source mount to a test that its mount namespace is the task's mount
namespace. If I just remove this one check, then userspace code like the
above seems to happily work.

I've tried various things in userspace to move opfd to the task's mount
namespace _without_ also attaching it to a directory tree somewhere as
we do today, but have come up short on a way to do that.

Assuming what I'm trying to do is in line with the intended use case for
these new(er) APIs, I'm wondering if some relatively small kernel change
might be the best way to enable this? Perhaps clone_private_mount(),
which seems to only be used in-tree by overlayfs, could also tolerate
mounts in "anonymous" (when created by alloc_mnt_ns) mount namespaces or
something?

Thanks
Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/2] fs: allow detached mounts in clone_private_mount()
  2025-01-23  4:18 ovl: Allow layers from anonymous mount namespaces? Mike Baynton
@ 2025-01-23 19:19 ` Christian Brauner
  2025-01-24  5:42   ` Mike Baynton
  2025-01-23 19:19 ` [RFC PATCH 2/2] selftests: add tests for using detached mount with overlayfs Christian Brauner
  2025-01-23 19:21 ` ovl: Allow layers from anonymous mount namespaces? Christian Brauner
  2 siblings, 1 reply; 7+ messages in thread
From: Christian Brauner @ 2025-01-23 19:19 UTC (permalink / raw)
  To: Amir Goldstein, Miklos Szeredi; +Cc: Christian Brauner, overlayfs, Mike Baynton

In container workloads idmapped mounts are often used as layers for
overlayfs. Recently I added the ability to specify layers in overlayfs
as file descriptors instead of path names. It should be possible to
simply use the detached mounts directly when specifying layers instead
of having to attach them beforehand. They are discarded after overlayfs
is mounted anyway so it's pointless system calls for userspace and
pointless locking for the kernel.

This just recently come up again in [1]. So enable clone_private_mount()
to use detached mounts directly. Following conditions must be met:

- Provided path must be the root of a detached mount tree.
- Provided path may not create mount namespace loops.
- Provided path must be mounted.

It would be possible to be stricter and require that the caller must
have CAP_SYS_ADMIN in the owning user namespace of the anonymous mount
namespace but since this restriction isn't enforced for move_mount()
there's no point in enforcing it for clone_private_mount().

Link: https://lore.kernel.org/r/fd8f6574-f737-4743-b220-79c815ee1554@mbaynton.com [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c | 78 ++++++++++++++++++++++++++++----------------------
 1 file changed, 43 insertions(+), 35 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 4013fbac354a..3985a695d373 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2287,6 +2287,28 @@ bool has_locked_children(struct mount *mnt, struct dentry *dentry)
 	return false;
 }
 
+/*
+ * Check that there aren't references to earlier/same mount namespaces in the
+ * specified subtree.  Such references can act as pins for mount namespaces
+ * that aren't checked by the mount-cycle checking code, thereby allowing
+ * cycles to be made.
+ */
+static bool check_for_nsfs_mounts(struct mount *subtree)
+{
+	struct mount *p;
+	bool ret = false;
+
+	lock_mount_hash();
+	for (p = subtree; p; p = next_mnt(p, subtree))
+		if (mnt_ns_loop(p->mnt.mnt_root))
+			goto out;
+
+	ret = true;
+out:
+	unlock_mount_hash();
+	return ret;
+}
+
 /**
  * clone_private_mount - create a private clone of a path
  * @path: path to clone
@@ -2295,6 +2317,8 @@ bool has_locked_children(struct mount *mnt, struct dentry *dentry)
  * will not be attached anywhere in the namespace and will be private (i.e.
  * changes to the originating mount won't be propagated into this).
  *
+ * This assumes caller has called or done the equivalent of may_mount().
+ *
  * Release with mntput().
  */
 struct vfsmount *clone_private_mount(const struct path *path)
@@ -2302,30 +2326,36 @@ struct vfsmount *clone_private_mount(const struct path *path)
 	struct mount *old_mnt = real_mount(path->mnt);
 	struct mount *new_mnt;
 
-	down_read(&namespace_sem);
+	scoped_guard(rwsem_read, &namespace_sem)
 	if (IS_MNT_UNBINDABLE(old_mnt))
-		goto invalid;
+		return ERR_PTR(-EINVAL);
+
+	if (mnt_has_parent(old_mnt)) {
+		if (!check_mnt(old_mnt))
+			return ERR_PTR(-EINVAL);
+	} else {
+		/* Make sure this isn't something purely kernel internal. */
+		if (!is_anon_ns(old_mnt->mnt_ns))
+			return ERR_PTR(-EINVAL);
 
-	if (!check_mnt(old_mnt))
-		goto invalid;
+		/* Make sure we don't create mount namespace loops. */
+		if (!check_for_nsfs_mounts(old_mnt))
+			return ERR_PTR(-EINVAL);
+
+		if (!path_mounted(path))
+			return ERR_PTR(-EINVAL);
+	}
 
 	if (has_locked_children(old_mnt, path->dentry))
-		goto invalid;
+		return ERR_PTR(-EINVAL);
 
 	new_mnt = clone_mnt(old_mnt, path->dentry, CL_PRIVATE);
-	up_read(&namespace_sem);
-
 	if (IS_ERR(new_mnt))
-		return ERR_CAST(new_mnt);
+		return ERR_PTR(-EINVAL);
 
 	/* Longterm mount to be removed by kern_unmount*() */
 	new_mnt->mnt_ns = MNT_NS_INTERNAL;
-
 	return &new_mnt->mnt;
-
-invalid:
-	up_read(&namespace_sem);
-	return ERR_PTR(-EINVAL);
 }
 EXPORT_SYMBOL_GPL(clone_private_mount);
 
@@ -3123,28 +3153,6 @@ static inline int tree_contains_unbindable(struct mount *mnt)
 	return 0;
 }
 
-/*
- * Check that there aren't references to earlier/same mount namespaces in the
- * specified subtree.  Such references can act as pins for mount namespaces
- * that aren't checked by the mount-cycle checking code, thereby allowing
- * cycles to be made.
- */
-static bool check_for_nsfs_mounts(struct mount *subtree)
-{
-	struct mount *p;
-	bool ret = false;
-
-	lock_mount_hash();
-	for (p = subtree; p; p = next_mnt(p, subtree))
-		if (mnt_ns_loop(p->mnt.mnt_root))
-			goto out;
-
-	ret = true;
-out:
-	unlock_mount_hash();
-	return ret;
-}
-
 static int do_set_group(struct path *from_path, struct path *to_path)
 {
 	struct mount *from, *to;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/2] selftests: add tests for using detached mount with overlayfs
  2025-01-23  4:18 ovl: Allow layers from anonymous mount namespaces? Mike Baynton
  2025-01-23 19:19 ` [RFC PATCH 1/2] fs: allow detached mounts in clone_private_mount() Christian Brauner
@ 2025-01-23 19:19 ` Christian Brauner
  2025-01-23 19:21 ` ovl: Allow layers from anonymous mount namespaces? Christian Brauner
  2 siblings, 0 replies; 7+ messages in thread
From: Christian Brauner @ 2025-01-23 19:19 UTC (permalink / raw)
  To: Amir Goldstein, Miklos Szeredi; +Cc: Christian Brauner, overlayfs, Mike Baynton

Test that it is possible to use detached mounts as overlayfs layers.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 .../overlayfs/set_layers_via_fds.c            | 120 ++++++++++++++++++
 .../filesystems/overlayfs/wrappers.h          |  17 +++
 2 files changed, 137 insertions(+)

diff --git a/tools/testing/selftests/filesystems/overlayfs/set_layers_via_fds.c b/tools/testing/selftests/filesystems/overlayfs/set_layers_via_fds.c
index 1d0ae785a667..897e94a17543 100644
--- a/tools/testing/selftests/filesystems/overlayfs/set_layers_via_fds.c
+++ b/tools/testing/selftests/filesystems/overlayfs/set_layers_via_fds.c
@@ -214,4 +214,124 @@ TEST_F(set_layers_via_fds, set_500_layers_via_fds)
 	ASSERT_EQ(close(fd_overlay), 0);
 }
 
+TEST_F(set_layers_via_fds, set_layers_via_detached_mount_fds)
+{
+	int fd_context, fd_tmpfs, fd_overlay;
+	int layer_fds[] = { [0 ... 8] = -EBADF };
+	bool layers_found[] = { [0 ... 8] =  false };
+	size_t len = 0;
+	char *line = NULL;
+	FILE *f_mountinfo;
+
+	ASSERT_EQ(unshare(CLONE_NEWNS), 0);
+	ASSERT_EQ(sys_mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL), 0);
+
+	fd_context = sys_fsopen("tmpfs", 0);
+	ASSERT_GE(fd_context, 0);
+
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_CMD_CREATE, NULL, NULL, 0), 0);
+	fd_tmpfs = sys_fsmount(fd_context, 0, 0);
+	ASSERT_GE(fd_tmpfs, 0);
+	ASSERT_EQ(close(fd_context), 0);
+
+	ASSERT_EQ(mkdirat(fd_tmpfs, "w", 0755), 0);
+	ASSERT_EQ(mkdirat(fd_tmpfs, "u", 0755), 0);
+	ASSERT_EQ(mkdirat(fd_tmpfs, "l1", 0755), 0);
+	ASSERT_EQ(mkdirat(fd_tmpfs, "l2", 0755), 0);
+	ASSERT_EQ(mkdirat(fd_tmpfs, "l3", 0755), 0);
+	ASSERT_EQ(mkdirat(fd_tmpfs, "l4", 0755), 0);
+	ASSERT_EQ(mkdirat(fd_tmpfs, "d1", 0755), 0);
+	ASSERT_EQ(mkdirat(fd_tmpfs, "d2", 0755), 0);
+	ASSERT_EQ(mkdirat(fd_tmpfs, "d3", 0755), 0);
+
+	layer_fds[0] = open_tree(fd_tmpfs, "w", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(layer_fds[0], 0);
+
+	layer_fds[1] = open_tree(fd_tmpfs, "u", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(layer_fds[1], 0);
+
+	layer_fds[2] = open_tree(fd_tmpfs, "l1", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(layer_fds[2], 0);
+
+	layer_fds[3] = open_tree(fd_tmpfs, "l2", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(layer_fds[3], 0);
+
+	layer_fds[4] = open_tree(fd_tmpfs, "l3", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(layer_fds[4], 0);
+
+	layer_fds[5] = open_tree(fd_tmpfs, "l4", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(layer_fds[5], 0);
+
+	layer_fds[6] = open_tree(fd_tmpfs, "d1", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(layer_fds[6], 0);
+
+	layer_fds[7] = open_tree(fd_tmpfs, "d2", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(layer_fds[7], 0);
+
+	layer_fds[8] = open_tree(fd_tmpfs, "d3", OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC);
+	ASSERT_GE(layer_fds[8], 0);
+
+	ASSERT_EQ(close(fd_tmpfs), 0);
+
+	fd_context = sys_fsopen("overlay", 0);
+	ASSERT_GE(fd_context, 0);
+
+	ASSERT_NE(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "lowerdir", NULL, layer_fds[2]), 0);
+
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "workdir",   NULL, layer_fds[0]), 0);
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "upperdir",  NULL, layer_fds[1]), 0);
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "lowerdir+", NULL, layer_fds[2]), 0);
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "lowerdir+", NULL, layer_fds[3]), 0);
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "lowerdir+", NULL, layer_fds[4]), 0);
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "lowerdir+", NULL, layer_fds[5]), 0);
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "datadir+",  NULL, layer_fds[6]), 0);
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "datadir+",  NULL, layer_fds[7]), 0);
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_FD, "datadir+",  NULL, layer_fds[8]), 0);
+
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_SET_STRING, "metacopy", "on", 0), 0);
+
+	ASSERT_EQ(sys_fsconfig(fd_context, FSCONFIG_CMD_CREATE, NULL, NULL, 0), 0);
+
+	fd_overlay = sys_fsmount(fd_context, 0, 0);
+	ASSERT_GE(fd_overlay, 0);
+
+	ASSERT_EQ(sys_move_mount(fd_overlay, "", -EBADF, "/set_layers_via_fds", MOVE_MOUNT_F_EMPTY_PATH), 0);
+
+	f_mountinfo = fopen("/proc/self/mountinfo", "r");
+	ASSERT_NE(f_mountinfo, NULL);
+
+	while (getline(&line, &len, f_mountinfo) != -1) {
+		char *haystack = line;
+
+		if (strstr(haystack, "workdir=/tmp/w"))
+			layers_found[0] = true;
+		if (strstr(haystack, "upperdir=/tmp/u"))
+			layers_found[1] = true;
+		if (strstr(haystack, "lowerdir+=/tmp/l1"))
+			layers_found[2] = true;
+		if (strstr(haystack, "lowerdir+=/tmp/l2"))
+			layers_found[3] = true;
+		if (strstr(haystack, "lowerdir+=/tmp/l3"))
+			layers_found[4] = true;
+		if (strstr(haystack, "lowerdir+=/tmp/l4"))
+			layers_found[5] = true;
+		if (strstr(haystack, "datadir+=/tmp/d1"))
+			layers_found[6] = true;
+		if (strstr(haystack, "datadir+=/tmp/d2"))
+			layers_found[7] = true;
+		if (strstr(haystack, "datadir+=/tmp/d3"))
+			layers_found[8] = true;
+	}
+	free(line);
+
+	for (int i = 0; i < ARRAY_SIZE(layer_fds); i++) {
+		ASSERT_EQ(layers_found[i], true);
+		ASSERT_EQ(close(layer_fds[i]), 0);
+	}
+
+	ASSERT_EQ(close(fd_context), 0);
+	ASSERT_EQ(close(fd_overlay), 0);
+	ASSERT_EQ(fclose(f_mountinfo), 0);
+}
+
 TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/filesystems/overlayfs/wrappers.h b/tools/testing/selftests/filesystems/overlayfs/wrappers.h
index 071b95fd2ac0..c38bc48e0cfa 100644
--- a/tools/testing/selftests/filesystems/overlayfs/wrappers.h
+++ b/tools/testing/selftests/filesystems/overlayfs/wrappers.h
@@ -44,4 +44,21 @@ static inline int sys_move_mount(int from_dfd, const char *from_pathname,
 		       to_pathname, flags);
 }
 
+#ifndef OPEN_TREE_CLONE
+#define OPEN_TREE_CLONE 1
+#endif
+
+#ifndef OPEN_TREE_CLOEXEC
+#define OPEN_TREE_CLOEXEC O_CLOEXEC
+#endif
+
+#ifndef AT_RECURSIVE
+#define AT_RECURSIVE 0x8000
+#endif
+
+static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
+{
+	return syscall(__NR_open_tree, dfd, filename, flags);
+}
+
 #endif
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: ovl: Allow layers from anonymous mount namespaces?
  2025-01-23  4:18 ovl: Allow layers from anonymous mount namespaces? Mike Baynton
  2025-01-23 19:19 ` [RFC PATCH 1/2] fs: allow detached mounts in clone_private_mount() Christian Brauner
  2025-01-23 19:19 ` [RFC PATCH 2/2] selftests: add tests for using detached mount with overlayfs Christian Brauner
@ 2025-01-23 19:21 ` Christian Brauner
  2025-01-24  5:40   ` Mike Baynton
  2 siblings, 1 reply; 7+ messages in thread
From: Christian Brauner @ 2025-01-23 19:21 UTC (permalink / raw)
  To: Mike Baynton; +Cc: overlayfs

On Wed, Jan 22, 2025 at 10:18:17PM -0600, Mike Baynton wrote:
> Hi,
> I've been eagerly awaiting the arrival of lowerdir+ by file handle, as
> it looks likely to be well-suited to simplifying the task a container
> runtime must take on in order to provide a set of properly idmapped
> lower layers for a user namespaced container. Currently in containerd,
> this is done by creating bindmounts for each required lower layer in
> order to apply idmapping to them. Each of these bindmounts must be
> briefly attached to some path-resolvable mountpoint before the overlay
> is created, which seems less than ideal and is contributing to some
> cleanup headaches e.g. when other software that may be present jumps on
> the new mount and starts security scanning it or whatnot.
> 
> In order to better isolate the idmap bindmounts I was hoping to do
> something like:
> 
> ovl_ctx = fsopen("overlay", FSOPEN_CLOEXEC);
> 
> opfd = open_tree(-1, "/path/to/unmapped/layer",
> OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC);
> mount_setattr(opfd, "", AT_EMPTY_PATH, /* attrs to set a userns_fd */);
> dfd = openat(opfd, ".", O_DIRECTORY, mode);

Unless I forgot detaile, openat() shouldn't be needed as speciyfing
layers via O_PATH file descriptors should just work.

> 
> fsconfig(ovl_ctx, FSCONFIG_SET_FD, "lowerdir+", dfd);
> // ...other ovl_ctx fsconfigs...
> fsconfig(ovl_ctx, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> 
> ...and this *almost* works in 6.13. The result of something like this is
> that the FSCONFIG_CMD_CREATE fails, with "overlayfs: failed to clone
> lowerpath" in dmesg. Investigating a bit, the cause is that the mount
> represented by opfd is placed in a newly allocated mount namespace
> containing only itself. When overlayfs then tries to make its own
> private copy of that mount, it uses clone_private_mount() which subjects
> any source mount to a test that its mount namespace is the task's mount
> namespace. If I just remove this one check, then userspace code like the
> above seems to happily work.
> 
> I've tried various things in userspace to move opfd to the task's mount
> namespace _without_ also attaching it to a directory tree somewhere as
> we do today, but have come up short on a way to do that.
> 
> Assuming what I'm trying to do is in line with the intended use case for
> these new(er) APIs, I'm wondering if some relatively small kernel change
> might be the best way to enable this? Perhaps clone_private_mount(),
> which seems to only be used in-tree by overlayfs, could also tolerate
> mounts in "anonymous" (when created by alloc_mnt_ns) mount namespaces or
> something?

This should be doable but requires some changes to
clone_private_mount(). I just sent an RFC patchset.
The patchset is entirely untested as of now.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ovl: Allow layers from anonymous mount namespaces?
  2025-01-23 19:21 ` ovl: Allow layers from anonymous mount namespaces? Christian Brauner
@ 2025-01-24  5:40   ` Mike Baynton
  2025-01-24 11:06     ` Christian Brauner
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Baynton @ 2025-01-24  5:40 UTC (permalink / raw)
  To: Christian Brauner; +Cc: overlayfs

On 1/23/25 13:21, Christian Brauner wrote:
> On Wed, Jan 22, 2025 at 10:18:17PM -0600, Mike Baynton wrote:
>> Hi,
>> I've been eagerly awaiting the arrival of lowerdir+ by file handle, as
>> it looks likely to be well-suited to simplifying the task a container
>> runtime must take on in order to provide a set of properly idmapped
>> lower layers for a user namespaced container. Currently in containerd,
>> this is done by creating bindmounts for each required lower layer in
>> order to apply idmapping to them. Each of these bindmounts must be
>> briefly attached to some path-resolvable mountpoint before the overlay
>> is created, which seems less than ideal and is contributing to some
>> cleanup headaches e.g. when other software that may be present jumps on
>> the new mount and starts security scanning it or whatnot.
>>
>> In order to better isolate the idmap bindmounts I was hoping to do
>> something like:
>>
>> ovl_ctx = fsopen("overlay", FSOPEN_CLOEXEC);
>>
>> opfd = open_tree(-1, "/path/to/unmapped/layer",
>> OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC);
>> mount_setattr(opfd, "", AT_EMPTY_PATH, /* attrs to set a userns_fd */);
>> dfd = openat(opfd, ".", O_DIRECTORY, mode);
> 
> Unless I forgot detaile, openat() shouldn't be needed as speciyfing
> layers via O_PATH file descriptors should just work.

O_PATH ones currently result in EBADF, iirc just because fsconfig with
FSCONFIG_SET_FD looks up the file descriptor in a way that masks O_PATH.
This took some time to work out too, but doesn't strike me as a huge
deal. Although I suppose it's one of those things that if it were
improved far down the road would probably lead to next to nobody
removing the openat().

> 
>>
>> fsconfig(ovl_ctx, FSCONFIG_SET_FD, "lowerdir+", dfd);
>> // ...other ovl_ctx fsconfigs...
>> fsconfig(ovl_ctx, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
>>
>> ...and this *almost* works in 6.13. The result of something like this is
>> that the FSCONFIG_CMD_CREATE fails, with "overlayfs: failed to clone
>> lowerpath" in dmesg. Investigating a bit, the cause is that the mount
>> represented by opfd is placed in a newly allocated mount namespace
>> containing only itself. When overlayfs then tries to make its own
>> private copy of that mount, it uses clone_private_mount() which subjects
>> any source mount to a test that its mount namespace is the task's mount
>> namespace. If I just remove this one check, then userspace code like the
>> above seems to happily work.
>>
>> I've tried various things in userspace to move opfd to the task's mount
>> namespace _without_ also attaching it to a directory tree somewhere as
>> we do today, but have come up short on a way to do that.
>>
>> Assuming what I'm trying to do is in line with the intended use case for
>> these new(er) APIs, I'm wondering if some relatively small kernel change
>> might be the best way to enable this? Perhaps clone_private_mount(),
>> which seems to only be used in-tree by overlayfs, could also tolerate
>> mounts in "anonymous" (when created by alloc_mnt_ns) mount namespaces or
>> something?
> 
> This should be doable but requires some changes to
> clone_private_mount(). I just sent an RFC patchset.
> The patchset is entirely untested as of now.

That's awesome, I really appreciate your prompt attention to this!
Applied and confirmed your patch works for my use case.

Thanks
Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/2] fs: allow detached mounts in clone_private_mount()
  2025-01-23 19:19 ` [RFC PATCH 1/2] fs: allow detached mounts in clone_private_mount() Christian Brauner
@ 2025-01-24  5:42   ` Mike Baynton
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Baynton @ 2025-01-24  5:42 UTC (permalink / raw)
  To: Christian Brauner, Amir Goldstein, Miklos Szeredi; +Cc: overlayfs

On 1/23/25 13:19, Christian Brauner wrote:
> In container workloads idmapped mounts are often used as layers for
> overlayfs. Recently I added the ability to specify layers in overlayfs
> as file descriptors instead of path names. It should be possible to
> simply use the detached mounts directly when specifying layers instead
> of having to attach them beforehand. They are discarded after overlayfs
> is mounted anyway so it's pointless system calls for userspace and
> pointless locking for the kernel.
> 
> This just recently come up again in [1]. So enable clone_private_mount()
> to use detached mounts directly. Following conditions must be met:
> 
> - Provided path must be the root of a detached mount tree.
> - Provided path may not create mount namespace loops.
> - Provided path must be mounted.
> 
> It would be possible to be stricter and require that the caller must
> have CAP_SYS_ADMIN in the owning user namespace of the anonymous mount
> namespace but since this restriction isn't enforced for move_mount()
> there's no point in enforcing it for clone_private_mount().
> 
> Link: https://lore.kernel.org/r/fd8f6574-f737-4743-b220-79c815ee1554@mbaynton.com [1]
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/namespace.c | 78 ++++++++++++++++++++++++++++----------------------
>  1 file changed, 43 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4013fbac354a..3985a695d373 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2287,6 +2287,28 @@ bool has_locked_children(struct mount *mnt, struct dentry *dentry)
>  	return false;
>  }
>  
> +/*
> + * Check that there aren't references to earlier/same mount namespaces in the
> + * specified subtree.  Such references can act as pins for mount namespaces
> + * that aren't checked by the mount-cycle checking code, thereby allowing
> + * cycles to be made.
> + */
> +static bool check_for_nsfs_mounts(struct mount *subtree)
> +{
> +	struct mount *p;
> +	bool ret = false;
> +
> +	lock_mount_hash();
> +	for (p = subtree; p; p = next_mnt(p, subtree))
> +		if (mnt_ns_loop(p->mnt.mnt_root))
> +			goto out;
> +
> +	ret = true;
> +out:
> +	unlock_mount_hash();
> +	return ret;
> +}
> +
>  /**
>   * clone_private_mount - create a private clone of a path
>   * @path: path to clone
> @@ -2295,6 +2317,8 @@ bool has_locked_children(struct mount *mnt, struct dentry *dentry)
>   * will not be attached anywhere in the namespace and will be private (i.e.
>   * changes to the originating mount won't be propagated into this).
>   *
> + * This assumes caller has called or done the equivalent of may_mount().
> + *
>   * Release with mntput().
>   */
>  struct vfsmount *clone_private_mount(const struct path *path)
> @@ -2302,30 +2326,36 @@ struct vfsmount *clone_private_mount(const struct path *path)
>  	struct mount *old_mnt = real_mount(path->mnt);
>  	struct mount *new_mnt;
>  
> -	down_read(&namespace_sem);
> +	scoped_guard(rwsem_read, &namespace_sem)
>  	if (IS_MNT_UNBINDABLE(old_mnt))
> -		goto invalid;
> +		return ERR_PTR(-EINVAL);
> +
> +	if (mnt_has_parent(old_mnt)) {
> +		if (!check_mnt(old_mnt))
> +			return ERR_PTR(-EINVAL);
> +	} else {
> +		/* Make sure this isn't something purely kernel internal. */
> +		if (!is_anon_ns(old_mnt->mnt_ns))
> +			return ERR_PTR(-EINVAL);
>  
> -	if (!check_mnt(old_mnt))
> -		goto invalid;
> +		/* Make sure we don't create mount namespace loops. */
> +		if (!check_for_nsfs_mounts(old_mnt))
> +			return ERR_PTR(-EINVAL);
> +
> +		if (!path_mounted(path))
> +			return ERR_PTR(-EINVAL);
> +	}
>  
>  	if (has_locked_children(old_mnt, path->dentry))
> -		goto invalid;
> +		return ERR_PTR(-EINVAL);
>  
>  	new_mnt = clone_mnt(old_mnt, path->dentry, CL_PRIVATE);
> -	up_read(&namespace_sem);
> -
>  	if (IS_ERR(new_mnt))
> -		return ERR_CAST(new_mnt);
> +		return ERR_PTR(-EINVAL);
>  
>  	/* Longterm mount to be removed by kern_unmount*() */
>  	new_mnt->mnt_ns = MNT_NS_INTERNAL;
> -
>  	return &new_mnt->mnt;
> -
> -invalid:
> -	up_read(&namespace_sem);
> -	return ERR_PTR(-EINVAL);
>  }
>  EXPORT_SYMBOL_GPL(clone_private_mount);
>  
> @@ -3123,28 +3153,6 @@ static inline int tree_contains_unbindable(struct mount *mnt)
>  	return 0;
>  }
>  
> -/*
> - * Check that there aren't references to earlier/same mount namespaces in the
> - * specified subtree.  Such references can act as pins for mount namespaces
> - * that aren't checked by the mount-cycle checking code, thereby allowing
> - * cycles to be made.
> - */
> -static bool check_for_nsfs_mounts(struct mount *subtree)
> -{
> -	struct mount *p;
> -	bool ret = false;
> -
> -	lock_mount_hash();
> -	for (p = subtree; p; p = next_mnt(p, subtree))
> -		if (mnt_ns_loop(p->mnt.mnt_root))
> -			goto out;
> -
> -	ret = true;
> -out:
> -	unlock_mount_hash();
> -	return ret;
> -}
> -
>  static int do_set_group(struct path *from_path, struct path *to_path)
>  {
>  	struct mount *from, *to;

Confirmed this works for the use case I'm interested in of directly
passing idmapped mounts to lower layers.

Tested-by: Mike Baynton <mike@mbaynton.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ovl: Allow layers from anonymous mount namespaces?
  2025-01-24  5:40   ` Mike Baynton
@ 2025-01-24 11:06     ` Christian Brauner
  0 siblings, 0 replies; 7+ messages in thread
From: Christian Brauner @ 2025-01-24 11:06 UTC (permalink / raw)
  To: Mike Baynton; +Cc: overlayfs

On Thu, Jan 23, 2025 at 11:40:41PM -0600, Mike Baynton wrote:
> On 1/23/25 13:21, Christian Brauner wrote:
> > On Wed, Jan 22, 2025 at 10:18:17PM -0600, Mike Baynton wrote:
> >> Hi,
> >> I've been eagerly awaiting the arrival of lowerdir+ by file handle, as
> >> it looks likely to be well-suited to simplifying the task a container
> >> runtime must take on in order to provide a set of properly idmapped
> >> lower layers for a user namespaced container. Currently in containerd,
> >> this is done by creating bindmounts for each required lower layer in
> >> order to apply idmapping to them. Each of these bindmounts must be
> >> briefly attached to some path-resolvable mountpoint before the overlay
> >> is created, which seems less than ideal and is contributing to some
> >> cleanup headaches e.g. when other software that may be present jumps on
> >> the new mount and starts security scanning it or whatnot.
> >>
> >> In order to better isolate the idmap bindmounts I was hoping to do
> >> something like:
> >>
> >> ovl_ctx = fsopen("overlay", FSOPEN_CLOEXEC);
> >>
> >> opfd = open_tree(-1, "/path/to/unmapped/layer",
> >> OPEN_TREE_CLONE|OPEN_TREE_CLOEXEC);
> >> mount_setattr(opfd, "", AT_EMPTY_PATH, /* attrs to set a userns_fd */);
> >> dfd = openat(opfd, ".", O_DIRECTORY, mode);
> > 
> > Unless I forgot detaile, openat() shouldn't be needed as speciyfing
> > layers via O_PATH file descriptors should just work.
> 
> O_PATH ones currently result in EBADF, iirc just because fsconfig with
> FSCONFIG_SET_FD looks up the file descriptor in a way that masks O_PATH.
> This took some time to work out too, but doesn't strike me as a huge
> deal. Although I suppose it's one of those things that if it were
> improved far down the road would probably lead to next to nobody
> removing the openat().

Oh right. We should be able to enable FSONFIG_SET_FD to accept O_PATH
file descriptors. To not break existing users we need do introduce:

diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 4b4bfef6f053..e160e7c61e4b 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -55,6 +55,7 @@ enum fs_value_type {
        fs_value_is_blob,               /* Value is a binary blob */
        fs_value_is_filename,           /* Value is a filename* + dirfd */
        fs_value_is_file,               /* Value is a file* */
+       fs_value_is_file_fmode_path,    /* Value is a file* */
 };

 /*
diff --git a/include/linux/fs_parser.h b/include/linux/fs_parser.h
index 3cef566088fc..17ba4951298b 100644
--- a/include/linux/fs_parser.h
+++ b/include/linux/fs_parser.h
@@ -134,6 +134,7 @@ static inline bool fs_validate_description(const char *name,
 #define fsparam_bdev(NAME, OPT)        __fsparam(fs_param_is_blockdev, NAME, OPT, 0, NULL)
 #define fsparam_path(NAME, OPT)        __fsparam(fs_param_is_path, NAME, OPT, 0, NULL)
 #define fsparam_fd(NAME, OPT)  __fsparam(fs_param_is_fd, NAME, OPT, 0, NULL)
+#define fsparam_path_fd(NAME, OPT)     __fsparam(fs_param_is_path_fd, NAME, OPT, 0, NULL)
 #define fsparam_file_or_string(NAME, OPT) \
                                __fsparam(fs_param_is_file_or_string, NAME, OPT, 0, NULL)
 #define fsparam_uid(NAME, OPT) __fsparam(fs_param_is_uid, NAME, OPT, 0, NULL)

and so that we don't break code and autofs FSCONFIG_SET_FD usage. Both
want non O_PATH fds. But otherwise I don't see an issue with this.

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-01-24 11:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-23  4:18 ovl: Allow layers from anonymous mount namespaces? Mike Baynton
2025-01-23 19:19 ` [RFC PATCH 1/2] fs: allow detached mounts in clone_private_mount() Christian Brauner
2025-01-24  5:42   ` Mike Baynton
2025-01-23 19:19 ` [RFC PATCH 2/2] selftests: add tests for using detached mount with overlayfs Christian Brauner
2025-01-23 19:21 ` ovl: Allow layers from anonymous mount namespaces? Christian Brauner
2025-01-24  5:40   ` Mike Baynton
2025-01-24 11:06     ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).