[PATCH RFC 0/4] bpf: cgroup device guard for non-initial user namespace

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/4] bpf: cgroup device guard for non-initial user namespace
@ 2023-08-14 14:26 Michael Weiß
  2023-08-14 14:26 ` [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog Michael Weiß
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Michael Weiß @ 2023-08-14 14:26 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Quentin Monnet, Alexander Viro
  Cc: bpf, linux-kernel, linux-fsdevel, gyroidos, Michael Weiß

Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
which allows to set a cgroup device program to be a device guard.
This may be used to guard actions on device nodes in non-initial
userns, e.g., mknod.

If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, we allow mknod() in the VFS if a
bpf cgroup device guard is enabled for the current task and check
CAP_MKNOD for the current user namespace instead of the init userns.

To avoid unusable device nodes on file systems mounted in
non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
for cgroup device guarded tasks.

Tested for a GyroidOS container generated by the cmld using the
following user space patch: https://github.com/gyroidos/cml/pull/394

I discussed this internally with Christian in the UAPI group, earlier.
I put this to the public list now, since also LXC/LXD Folks have
announced interest on this.

This series applies to the latest mainline v6.5-rc6 tag.

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
---
Michael Weiß (4):
      bpf: add cgroup device guard to flag a cgroup device prog
      bpf: provide cgroup_device_guard in bpf_prog_info to user space
      device_cgroup: wrapper for bpf cgroup device guard
      fs: allow mknod in non-initial userns using cgroup device guard

 fs/namei.c                     | 19 ++++++++++++++++---
 include/linux/bpf-cgroup.h     |  7 +++++++
 include/linux/bpf.h            |  1 +
 include/linux/device_cgroup.h  |  7 +++++++
 include/uapi/linux/bpf.h       |  8 +++++++-
 kernel/bpf/cgroup.c            | 30 ++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |  6 +++++-
 security/device_cgroup.c       | 10 ++++++++++
 tools/bpf/bpftool/prog.c       |  2 ++
 tools/include/uapi/linux/bpf.h |  8 +++++++-
 10 files changed, 92 insertions(+), 6 deletions(-)
---
base-commit: 2ccdd1b13c591d306f0401d98dedc4bdcd02b421
change-id: 20230814-devcg_guard-5398ef84bf7b

Best regards,
-- 
Michael Weiß <michael.weiss@aisec.fraunhofer.de>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-08-14 14:26 [PATCH RFC 0/4] bpf: cgroup device guard for non-initial user namespace Michael Weiß
@ 2023-08-14 14:26 ` Michael Weiß
  2023-08-14 15:54   ` Alexander Mikhalitsyn
  2023-08-15  8:59   ` Christian Brauner
  2023-08-14 14:26 ` [PATCH RFC 2/4] bpf: provide cgroup_device_guard in bpf_prog_info to user space Michael Weiß
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Michael Weiß @ 2023-08-14 14:26 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Quentin Monnet, Alexander Viro
  Cc: bpf, linux-kernel, linux-fsdevel, gyroidos, Michael Weiß

Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
which allows to set a cgroup device program to be a device guard.
Later this may be used to guard actions on device nodes in
non-initial userns. For this reason we provide the helper function
cgroup_bpf_device_guard_enabled() to check if a task has a cgroups
device program which is a device guard in its effective set of bpf
programs.

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
---
 include/linux/bpf-cgroup.h     |  7 +++++++
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       |  5 +++++
 kernel/bpf/cgroup.c            | 30 ++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |  5 ++++-
 tools/include/uapi/linux/bpf.h |  5 +++++
 6 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 57e9e109257e..112b6093f9fd 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -184,6 +184,8 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
 	return array != &bpf_empty_prog_array.hdr;
 }
 
+bool cgroup_bpf_device_guard_enabled(struct task_struct *task);
+
 /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)			      \
 ({									      \
@@ -476,6 +478,11 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
 	return 0;
 }
 
+static bool cgroup_bpf_device_guard_enabled(struct task_struct *task)
+{
+	return false;
+}
+
 #define cgroup_bpf_enabled(atype) (0)
 #define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, atype, t_ctx) ({ 0; })
 #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, atype) ({ 0; })
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f58895830ada..313cce8aee05 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1384,6 +1384,7 @@ struct bpf_prog_aux {
 	bool sleepable;
 	bool tail_call_reachable;
 	bool xdp_has_frags;
+	bool cgroup_device_guard;
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
 	const struct btf_type *attach_func_proto;
 	/* function name for valid attach_btf_id */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 60a9d59beeab..3be57f7957b1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1165,6 +1165,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_DEV_BOUND_ONLY	(1U << 6)
 
+/* If BPF_F_CGROUP_DEVICE_GUARD is used in BPF_PROG_LOAD command, the loaded
+ * program will be allowed to guard device access inside user namespaces.
+ */
+#define BPF_F_CGROUP_DEVICE_GUARD	(1U << 7)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 5b2741aa0d9b..230693ca4cdb 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -2505,6 +2505,36 @@ const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
 const struct bpf_prog_ops cg_sockopt_prog_ops = {
 };
 
+bool
+cgroup_bpf_device_guard_enabled(struct task_struct *task)
+{
+	bool ret;
+	const struct bpf_prog_array *array;
+	const struct bpf_prog_array_item *item;
+	const struct bpf_prog *prog;
+	struct cgroup *cgrp = task_dfl_cgroup(task);
+
+	ret = false;
+
+	array = rcu_access_pointer(cgrp->bpf.effective[CGROUP_DEVICE]);
+	if (array == &bpf_empty_prog_array.hdr)
+		return ret;
+
+	mutex_lock(&cgroup_mutex);
+	array = rcu_dereference_protected(cgrp->bpf.effective[CGROUP_DEVICE],
+					      lockdep_is_held(&cgroup_mutex));
+	item = &array->items[0];
+	while ((prog = READ_ONCE(item->prog))) {
+		if (prog->aux->cgroup_device_guard) {
+			ret = true;
+			break;
+		}
+		item++;
+	}
+	mutex_unlock(&cgroup_mutex);
+	return ret;
+}
+
 /* Common helpers for cgroup hooks. */
 const struct bpf_func_proto *
 cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a2aef900519c..33ea67c702c1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2564,7 +2564,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 				 BPF_F_SLEEPABLE |
 				 BPF_F_TEST_RND_HI32 |
 				 BPF_F_XDP_HAS_FRAGS |
-				 BPF_F_XDP_DEV_BOUND_ONLY))
+				 BPF_F_XDP_DEV_BOUND_ONLY |
+				 BPF_F_CGROUP_DEVICE_GUARD))
 		return -EINVAL;
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -2651,6 +2652,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	prog->aux->dev_bound = !!attr->prog_ifindex;
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
+	prog->aux->cgroup_device_guard =
+		attr->prog_flags & BPF_F_CGROUP_DEVICE_GUARD;
 
 	err = security_bpf_prog_alloc(prog->aux);
 	if (err)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 60a9d59beeab..3be57f7957b1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1165,6 +1165,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_DEV_BOUND_ONLY	(1U << 6)
 
+/* If BPF_F_CGROUP_DEVICE_GUARD is used in BPF_PROG_LOAD command, the loaded
+ * program will be allowed to guard device access inside user namespaces.
+ */
+#define BPF_F_CGROUP_DEVICE_GUARD	(1U << 7)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */

-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/4] bpf: provide cgroup_device_guard in bpf_prog_info to user space
  2023-08-14 14:26 [PATCH RFC 0/4] bpf: cgroup device guard for non-initial user namespace Michael Weiß
  2023-08-14 14:26 ` [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog Michael Weiß
@ 2023-08-14 14:26 ` Michael Weiß
  2023-08-14 14:26 ` [PATCH RFC 3/4] device_cgroup: wrapper for bpf cgroup device guard Michael Weiß
  2023-08-14 14:26 ` [PATCH RFC 4/4] fs: allow mknod in non-initial userns using " Michael Weiß
  3 siblings, 0 replies; 18+ messages in thread
From: Michael Weiß @ 2023-08-14 14:26 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Quentin Monnet, Alexander Viro
  Cc: bpf, linux-kernel, linux-fsdevel, gyroidos, Michael Weiß

To allow user space tools to check if a device guard is active,
we extend the struct bpf_prog_info by a cgroup_device_guard field.
This is then used by the bpftool in print_prog_header_*() functions.

Output of bpftool, here for the bpf prog of a GyroidOS container:

 # ./bpftool prog show id 37
 37: cgroup_device  tag 1824c08482acee1b  gpl  cgdev_guard
 	loaded_at 2023-08-14T13:47:10+0200  uid 0
 	xlated 456B  jited 311B  memlock 4096B

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
---
 include/uapi/linux/bpf.h       | 3 ++-
 kernel/bpf/syscall.c           | 1 +
 tools/bpf/bpftool/prog.c       | 2 ++
 tools/include/uapi/linux/bpf.h | 3 ++-
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3be57f7957b1..7b383665d5f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6331,7 +6331,8 @@ struct bpf_prog_info {
 	char name[BPF_OBJ_NAME_LEN];
 	__u32 ifindex;
 	__u32 gpl_compatible:1;
-	__u32 :31; /* alignment pad */
+	__u32 cgroup_device_guard:1;
+	__u32 :30; /* alignment pad */
 	__u64 netns_dev;
 	__u64 netns_ino;
 	__u32 nr_jited_ksyms;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 33ea67c702c1..9bc6d5dd2e90 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4062,6 +4062,7 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 	info.created_by_uid = from_kuid_munged(current_user_ns(),
 					       prog->aux->user->uid);
 	info.gpl_compatible = prog->gpl_compatible;
+	info.cgroup_device_guard = prog->aux->cgroup_device_guard;
 
 	memcpy(info.tag, prog->tag, sizeof(prog->tag));
 	memcpy(info.name, prog->aux->name, sizeof(prog->aux->name));
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 8443a149dd17..66d21794b641 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -434,6 +434,7 @@ static void print_prog_header_json(struct bpf_prog_info *info, int fd)
 		     info->tag[4], info->tag[5], info->tag[6], info->tag[7]);
 
 	jsonw_bool_field(json_wtr, "gpl_compatible", info->gpl_compatible);
+	jsonw_bool_field(json_wtr, "cgroup_device_guard", info->cgroup_device_guard);
 	if (info->run_time_ns) {
 		jsonw_uint_field(json_wtr, "run_time_ns", info->run_time_ns);
 		jsonw_uint_field(json_wtr, "run_cnt", info->run_cnt);
@@ -519,6 +520,7 @@ static void print_prog_header_plain(struct bpf_prog_info *info, int fd)
 	fprint_hex(stdout, info->tag, BPF_TAG_SIZE, "");
 	print_dev_plain(info->ifindex, info->netns_dev, info->netns_ino);
 	printf("%s", info->gpl_compatible ? "  gpl" : "");
+	printf("%s", info->cgroup_device_guard ? "  cgdev_guard" : "");
 	if (info->run_time_ns)
 		printf(" run_time_ns %lld run_cnt %lld",
 		       info->run_time_ns, info->run_cnt);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3be57f7957b1..7b383665d5f4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6331,7 +6331,8 @@ struct bpf_prog_info {
 	char name[BPF_OBJ_NAME_LEN];
 	__u32 ifindex;
 	__u32 gpl_compatible:1;
-	__u32 :31; /* alignment pad */
+	__u32 cgroup_device_guard:1;
+	__u32 :30; /* alignment pad */
 	__u64 netns_dev;
 	__u64 netns_ino;
 	__u32 nr_jited_ksyms;

-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 3/4] device_cgroup: wrapper for bpf cgroup device guard
  2023-08-14 14:26 [PATCH RFC 0/4] bpf: cgroup device guard for non-initial user namespace Michael Weiß
  2023-08-14 14:26 ` [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog Michael Weiß
  2023-08-14 14:26 ` [PATCH RFC 2/4] bpf: provide cgroup_device_guard in bpf_prog_info to user space Michael Weiß
@ 2023-08-14 14:26 ` Michael Weiß
  2023-08-14 14:26 ` [PATCH RFC 4/4] fs: allow mknod in non-initial userns using " Michael Weiß
  3 siblings, 0 replies; 18+ messages in thread
From: Michael Weiß @ 2023-08-14 14:26 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Quentin Monnet, Alexander Viro
  Cc: bpf, linux-kernel, linux-fsdevel, gyroidos, Michael Weiß

Export the bpf based cgroup guard through device_cgroup to others.
Thus, devcgroup_task_is_guarded() could be used by subsystems
which already make use of device_cgroup features, such as fs/namei.

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
---
 include/linux/device_cgroup.h |  7 +++++++
 security/device_cgroup.c      | 10 ++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index d02f32b7514e..00c0748b6a8d 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -65,3 +65,10 @@ static inline int devcgroup_inode_permission(struct inode *inode, int mask)
 static inline int devcgroup_inode_mknod(int mode, dev_t dev)
 { return 0; }
 #endif
+
+#ifdef CONFIG_CGROUP_BPF
+bool devcgroup_task_is_guarded(struct task_struct *task);
+#else
+static inline bool devcgroup_task_is_guarded(struct task_struct *task)
+{ return false; }
+#endif /* CONFIG_CGROUP_BPF */
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index dc4df7475081..95200a3d0b63 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -874,3 +874,13 @@ int devcgroup_check_permission(short type, u32 major, u32 minor, short access)
 }
 EXPORT_SYMBOL(devcgroup_check_permission);
 #endif /* defined(CONFIG_CGROUP_DEVICE) || defined(CONFIG_CGROUP_BPF) */
+
+#ifdef CONFIG_CGROUP_BPF
+
+bool devcgroup_task_is_guarded(struct task_struct *task)
+{
+	return (cgroup_bpf_enabled(CGROUP_DEVICE) &&
+		cgroup_bpf_device_guard_enabled(task));
+}
+EXPORT_SYMBOL(devcgroup_task_is_guarded);
+#endif /* CONFIG_CGROUP_BPF */

-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard
  2023-08-14 14:26 [PATCH RFC 0/4] bpf: cgroup device guard for non-initial user namespace Michael Weiß
                   ` (2 preceding siblings ...)
  2023-08-14 14:26 ` [PATCH RFC 3/4] device_cgroup: wrapper for bpf cgroup device guard Michael Weiß
@ 2023-08-14 14:26 ` Michael Weiß
  2023-08-14 15:24   ` Alexander Mikhalitsyn
  2023-08-15  7:18   ` kernel test robot
  3 siblings, 2 replies; 18+ messages in thread
From: Michael Weiß @ 2023-08-14 14:26 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Quentin Monnet, Alexander Viro
  Cc: bpf, linux-kernel, linux-fsdevel, gyroidos, Michael Weiß

If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, we allow mknod() in fs/namei.c if a
bpf cgroup device guard is enabeld for the current task using
devcgroup_task_is_guarded() and check CAP_MKNOD for the current user
namespace by ns_capable() instead of the global CAP_MKNOD.

To avoid unusable device nodes on file systems mounted in
non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
for cgroup device guarded tasks.

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
---
 fs/namei.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e56ff39a79bc..ef4f22b9575c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj);

 bool may_open_dev(const struct path *path)
 {
+	if (devcgroup_task_is_guarded(current))
+		return !(path->mnt->mnt_flags & MNT_NODEV);
+
 	return !(path->mnt->mnt_flags & MNT_NODEV) &&
 		!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
 }
@@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		return error;

-	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
-	    !capable(CAP_MKNOD))
-		return -EPERM;
+	/*
+	 * In case of a device cgroup restirction allow mknod in user
+	 * namespace. Otherwise just check global capability; thus,
+	 * mknod is also disabled for user namespace other than the
+	 * initial one.
+	 */
+	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
+		if (devcgroup_task_is_guarded(current)) {
+			if (!ns_capable(current_user_ns(), CAP_MKNOD))
+				return -EPERM;
+		} else if (!capable(CAP_MKNOD))
+			return -EPERM;
+	}

 	if (!dir->i_op->mknod)
 		return -EPERM;

-- 
2.30.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard
  2023-08-14 14:26 ` [PATCH RFC 4/4] fs: allow mknod in non-initial userns using " Michael Weiß
@ 2023-08-14 15:24   ` Alexander Mikhalitsyn
  2023-08-15  7:18   ` kernel test robot
  1 sibling, 0 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2023-08-14 15:24 UTC (permalink / raw)
  To: Michael Weiß
  Cc: Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Quentin Monnet, Alexander Viro, bpf, linux-kernel, linux-fsdevel,
	gyroidos, stgraber

+CC Stéphane Graber <stgraber@ubuntu.com>


On Mon, Aug 14, 2023 at 4:26 PM Michael Weiß
<michael.weiss@aisec.fraunhofer.de> wrote:
>
> If a container manager restricts its unprivileged (user namespaced)
> children by a device cgroup, it is not necessary to deny mknod
> anymore. Thus, user space applications may map devices on different
> locations in the file system by using mknod() inside the container.
>
> A use case for this, we also use in GyroidOS, is to run virsh for
> VMs inside an unprivileged container. virsh creates device nodes,
> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
> in a non-initial userns, even if a cgroup device white list with the
> corresponding major, minor of /dev/null exists. Thus, in this case
> the usual bind mounts or pre populated device nodes under /dev are
> not sufficient.
>
> To circumvent this limitation, we allow mknod() in fs/namei.c if a
> bpf cgroup device guard is enabeld for the current task using
> devcgroup_task_is_guarded() and check CAP_MKNOD for the current user
> namespace by ns_capable() instead of the global CAP_MKNOD.
>
> To avoid unusable device nodes on file systems mounted in
> non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
> for cgroup device guarded tasks.
>
> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
> ---
>  fs/namei.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index e56ff39a79bc..ef4f22b9575c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj);
>
>  bool may_open_dev(const struct path *path)
>  {
> +       if (devcgroup_task_is_guarded(current))
> +               return !(path->mnt->mnt_flags & MNT_NODEV);
> +
>         return !(path->mnt->mnt_flags & MNT_NODEV) &&
>                 !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
>  }
> @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
>         if (error)
>                 return error;
>
> -       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
> -           !capable(CAP_MKNOD))
> -               return -EPERM;
> +       /*
> +        * In case of a device cgroup restirction allow mknod in user
> +        * namespace. Otherwise just check global capability; thus,
> +        * mknod is also disabled for user namespace other than the
> +        * initial one.
> +        */
> +       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
> +               if (devcgroup_task_is_guarded(current)) {
> +                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
> +                               return -EPERM;
> +               } else if (!capable(CAP_MKNOD))
> +                       return -EPERM;
> +       }
>
>         if (!dir->i_op->mknod)
>                 return -EPERM;
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-08-14 14:26 ` [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog Michael Weiß
@ 2023-08-14 15:54   ` Alexander Mikhalitsyn
  2023-08-17 15:50     ` Michael Weiß
  2023-08-15  8:59   ` Christian Brauner
  1 sibling, 1 reply; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2023-08-14 15:54 UTC (permalink / raw)
  To: Michael Weiß
  Cc: Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Quentin Monnet, Alexander Viro, bpf, linux-kernel, linux-fsdevel,
	gyroidos, stgraber

On Mon, Aug 14, 2023 at 4:32 PM Michael Weiß
<michael.weiss@aisec.fraunhofer.de> wrote:
>
> Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
> which allows to set a cgroup device program to be a device guard.
> Later this may be used to guard actions on device nodes in
> non-initial userns. For this reason we provide the helper function
> cgroup_bpf_device_guard_enabled() to check if a task has a cgroups
> device program which is a device guard in its effective set of bpf
> programs.
>
> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>

Hi Michael!

Thanks for working on this. It's also very useful for the LXC system
containers project.

> ---
>  include/linux/bpf-cgroup.h     |  7 +++++++
>  include/linux/bpf.h            |  1 +
>  include/uapi/linux/bpf.h       |  5 +++++
>  kernel/bpf/cgroup.c            | 30 ++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           |  5 ++++-
>  tools/include/uapi/linux/bpf.h |  5 +++++
>  6 files changed, 52 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 57e9e109257e..112b6093f9fd 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -184,6 +184,8 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
>         return array != &bpf_empty_prog_array.hdr;
>  }
>
> +bool cgroup_bpf_device_guard_enabled(struct task_struct *task);
> +
>  /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
>  #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)                            \
>  ({                                                                           \
> @@ -476,6 +478,11 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
>         return 0;
>  }
>
> +static bool cgroup_bpf_device_guard_enabled(struct task_struct *task)
> +{
> +       return false;
> +}
> +
>  #define cgroup_bpf_enabled(atype) (0)
>  #define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, atype, t_ctx) ({ 0; })
>  #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, atype) ({ 0; })
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f58895830ada..313cce8aee05 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1384,6 +1384,7 @@ struct bpf_prog_aux {
>         bool sleepable;
>         bool tail_call_reachable;
>         bool xdp_has_frags;
> +       bool cgroup_device_guard;
>         /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
>         const struct btf_type *attach_func_proto;
>         /* function name for valid attach_btf_id */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 60a9d59beeab..3be57f7957b1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1165,6 +1165,11 @@ enum bpf_link_type {
>   */
>  #define BPF_F_XDP_DEV_BOUND_ONLY       (1U << 6)
>
> +/* If BPF_F_CGROUP_DEVICE_GUARD is used in BPF_PROG_LOAD command, the loaded
> + * program will be allowed to guard device access inside user namespaces.
> + */
> +#define BPF_F_CGROUP_DEVICE_GUARD      (1U << 7)
> +
>  /* link_create.kprobe_multi.flags used in LINK_CREATE command for
>   * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
>   */
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 5b2741aa0d9b..230693ca4cdb 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -2505,6 +2505,36 @@ const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
>  const struct bpf_prog_ops cg_sockopt_prog_ops = {
>  };
>
> +bool
> +cgroup_bpf_device_guard_enabled(struct task_struct *task)
> +{
> +       bool ret;
> +       const struct bpf_prog_array *array;
> +       const struct bpf_prog_array_item *item;
> +       const struct bpf_prog *prog;
> +       struct cgroup *cgrp = task_dfl_cgroup(task);
> +
> +       ret = false;
> +
> +       array = rcu_access_pointer(cgrp->bpf.effective[CGROUP_DEVICE]);
> +       if (array == &bpf_empty_prog_array.hdr)
> +               return ret;
> +
> +       mutex_lock(&cgroup_mutex);

-> cgroup_lock();

> +       array = rcu_dereference_protected(cgrp->bpf.effective[CGROUP_DEVICE],
> +                                             lockdep_is_held(&cgroup_mutex));
> +       item = &array->items[0];
> +       while ((prog = READ_ONCE(item->prog))) {
> +               if (prog->aux->cgroup_device_guard) {
> +                       ret = true;
> +                       break;
> +               }
> +               item++;
> +       }
> +       mutex_unlock(&cgroup_mutex);

Don't we want to make this function specific to "current" task? This
allows to make locking lighter like in
__cgroup_bpf_check_dev_permission
https://github.com/torvalds/linux/blob/2ccdd1b13c591d306f0401d98dedc4bdcd02b421/kernel/bpf/cgroup.c#L1531
Here we have only RCU read lock.

AFAIK, cgroup_mutex is a heavily-contended lock.

Kind regards,
Alex

> +       return ret;
> +}
> +
>  /* Common helpers for cgroup hooks. */
>  const struct bpf_func_proto *
>  cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index a2aef900519c..33ea67c702c1 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2564,7 +2564,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
>                                  BPF_F_SLEEPABLE |
>                                  BPF_F_TEST_RND_HI32 |
>                                  BPF_F_XDP_HAS_FRAGS |
> -                                BPF_F_XDP_DEV_BOUND_ONLY))
> +                                BPF_F_XDP_DEV_BOUND_ONLY |
> +                                BPF_F_CGROUP_DEVICE_GUARD))
>                 return -EINVAL;
>
>         if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
> @@ -2651,6 +2652,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
>         prog->aux->dev_bound = !!attr->prog_ifindex;
>         prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
>         prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
> +       prog->aux->cgroup_device_guard =
> +               attr->prog_flags & BPF_F_CGROUP_DEVICE_GUARD;
>
>         err = security_bpf_prog_alloc(prog->aux);
>         if (err)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 60a9d59beeab..3be57f7957b1 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1165,6 +1165,11 @@ enum bpf_link_type {
>   */
>  #define BPF_F_XDP_DEV_BOUND_ONLY       (1U << 6)
>
> +/* If BPF_F_CGROUP_DEVICE_GUARD is used in BPF_PROG_LOAD command, the loaded
> + * program will be allowed to guard device access inside user namespaces.
> + */
> +#define BPF_F_CGROUP_DEVICE_GUARD      (1U << 7)
> +
>  /* link_create.kprobe_multi.flags used in LINK_CREATE command for
>   * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
>   */
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard
  2023-08-14 14:26 ` [PATCH RFC 4/4] fs: allow mknod in non-initial userns using " Michael Weiß
  2023-08-14 15:24   ` Alexander Mikhalitsyn
@ 2023-08-15  7:18   ` kernel test robot
  2023-08-15  7:49     ` Alexander Mikhalitsyn
  1 sibling, 1 reply; 18+ messages in thread
From: kernel test robot @ 2023-08-15  7:18 UTC (permalink / raw)
  To: Michael Weiß
  Cc: oe-lkp, lkp, linux-fsdevel, Alexander Mikhalitsyn,
	Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Quentin Monnet, Alexander Viro, bpf, linux-kernel, gyroidos,
	Michael Weiß, oliver.sang



Hello,

kernel test robot noticed "WARNING:suspicious_RCU_usage" on:

commit: bffc333633f1e681c01ada11bd695aa220518bd8 ("[PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard")
url: https://github.com/intel-lab-lkp/linux/commits/Michael-Wei/bpf-add-cgroup-device-guard-to-flag-a-cgroup-device-prog/20230814-224110
patch link: https://lore.kernel.org/all/20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de/
patch subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard

in testcase: boot

compiler: gcc-12
test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202308151506.6be3b169-oliver.sang@intel.com



[   14.468719][  T139]
[   14.468999][  T139] =============================
[   14.469545][  T139] WARNING: suspicious RCU usage
[   14.469968][  T139] 6.5.0-rc6-00004-gbffc333633f1 #1 Not tainted
[   14.470520][  T139] -----------------------------
[   14.470940][  T139] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage!
[   14.471703][  T139]
[   14.471703][  T139] other info that might help us debug this:
[   14.471703][  T139]
[   14.472692][  T139]
[   14.472692][  T139] rcu_scheduler_active = 2, debug_locks = 1
[   14.473469][  T139] no locks held by (journald)/139.
[   14.473935][  T139]
[   14.473935][  T139] stack backtrace:
[   14.474454][  T139] CPU: 1 PID: 139 Comm: (journald) Not tainted 6.5.0-rc6-00004-gbffc333633f1 #1
[   14.475296][  T139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   14.476298][  T139] Call Trace:
[   14.476608][  T139]  dump_stack_lvl+0x78/0x8c
[   14.477055][  T139]  dump_stack+0x12/0x18
[   14.477420][  T139]  lockdep_rcu_suspicious+0x153/0x1a4
[   14.477928][  T139]  cgroup_bpf_device_guard_enabled+0x14f/0x168
[   14.478476][  T139]  devcgroup_task_is_guarded+0x10/0x20
[   14.478973][  T139]  may_open_dev+0x11/0x44
[   14.479367][  T139]  may_open+0x115/0x13c
[   14.479727][  T139]  do_open+0xa1/0x378
[   14.480113][  T139]  path_openat+0xdc/0x1bc
[   14.480512][  T139]  do_filp_open+0x91/0x124
[   14.480911][  T139]  ? lock_release+0x62/0x118
[   14.481329][  T139]  ? _raw_spin_unlock+0x18/0x34
[   14.481797][  T139]  ? alloc_fd+0x112/0x1c4
[   14.482183][  T139]  do_sys_openat2+0x7a/0xa0
[   14.482592][  T139]  __ia32_sys_openat+0x66/0x9c
[   14.483065][  T139]  do_int80_syscall_32+0x27/0x48
[   14.483502][  T139]  entry_INT80_32+0x10d/0x10d
[   14.483962][  T139] EIP: 0xa7f39092
[   14.484267][  T139] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 f8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4
 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
[   14.485995][  T139] EAX: ffffffda EBX: ffffff9c ECX: 005df542 EDX: 00008100
[   14.486622][  T139] ESI: 00000000 EDI: 00000000 EBP: affeb888 ESP: affeb6ec
[   14.487225][  T139] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230815/202308151506.6be3b169-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard
  2023-08-15  7:18   ` kernel test robot
@ 2023-08-15  7:49     ` Alexander Mikhalitsyn
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2023-08-15  7:49 UTC (permalink / raw)
  To: kernel test robot
  Cc: Michael Weiß, oe-lkp, lkp, linux-fsdevel, Christian Brauner,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Quentin Monnet,
	Alexander Viro, bpf, linux-kernel, gyroidos, stgraber

On Tue, Aug 15, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "WARNING:suspicious_RCU_usage" on:
>
> commit: bffc333633f1e681c01ada11bd695aa220518bd8 ("[PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard")
> url: https://github.com/intel-lab-lkp/linux/commits/Michael-Wei/bpf-add-cgroup-device-guard-to-flag-a-cgroup-device-prog/20230814-224110
> patch link: https://lore.kernel.org/all/20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de/
> patch subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard
>
> in testcase: boot
>
> compiler: gcc-12
> test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202308151506.6be3b169-oliver.sang@intel.com
>
>
>
> [   14.468719][  T139]
> [   14.468999][  T139] =============================
> [   14.469545][  T139] WARNING: suspicious RCU usage
> [   14.469968][  T139] 6.5.0-rc6-00004-gbffc333633f1 #1 Not tainted
> [   14.470520][  T139] -----------------------------
> [   14.470940][  T139] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage!

Most likely it's because in "cgroup_bpf_device_guard_enabled" function:

struct cgroup *cgrp = task_dfl_cgroup(task);

should be under rcu_read_lock (or cgroup_mutex). If we get rid of
cgroup_mutex and make cgroup_bpf_device_guard_enabled
function specific to "current" task we will solve this issue too.

> [   14.471703][  T139]
> [   14.471703][  T139] other info that might help us debug this:
> [   14.471703][  T139]
> [   14.472692][  T139]
> [   14.472692][  T139] rcu_scheduler_active = 2, debug_locks = 1
> [   14.473469][  T139] no locks held by (journald)/139.
> [   14.473935][  T139]
> [   14.473935][  T139] stack backtrace:
> [   14.474454][  T139] CPU: 1 PID: 139 Comm: (journald) Not tainted 6.5.0-rc6-00004-gbffc333633f1 #1
> [   14.475296][  T139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   14.476298][  T139] Call Trace:
> [   14.476608][  T139]  dump_stack_lvl+0x78/0x8c
> [   14.477055][  T139]  dump_stack+0x12/0x18
> [   14.477420][  T139]  lockdep_rcu_suspicious+0x153/0x1a4
> [   14.477928][  T139]  cgroup_bpf_device_guard_enabled+0x14f/0x168
> [   14.478476][  T139]  devcgroup_task_is_guarded+0x10/0x20
> [   14.478973][  T139]  may_open_dev+0x11/0x44
> [   14.479367][  T139]  may_open+0x115/0x13c
> [   14.479727][  T139]  do_open+0xa1/0x378
> [   14.480113][  T139]  path_openat+0xdc/0x1bc
> [   14.480512][  T139]  do_filp_open+0x91/0x124
> [   14.480911][  T139]  ? lock_release+0x62/0x118
> [   14.481329][  T139]  ? _raw_spin_unlock+0x18/0x34
> [   14.481797][  T139]  ? alloc_fd+0x112/0x1c4
> [   14.482183][  T139]  do_sys_openat2+0x7a/0xa0
> [   14.482592][  T139]  __ia32_sys_openat+0x66/0x9c
> [   14.483065][  T139]  do_int80_syscall_32+0x27/0x48
> [   14.483502][  T139]  entry_INT80_32+0x10d/0x10d
> [   14.483962][  T139] EIP: 0xa7f39092
> [   14.484267][  T139] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 f8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4
>  26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> [   14.485995][  T139] EAX: ffffffda EBX: ffffff9c ECX: 005df542 EDX: 00008100
> [   14.486622][  T139] ESI: 00000000 EDI: 00000000 EBP: affeb888 ESP: affeb6ec
> [   14.487225][  T139] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
>
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20230815/202308151506.6be3b169-oliver.sang@intel.com
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-08-14 14:26 ` [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog Michael Weiß
  2023-08-14 15:54   ` Alexander Mikhalitsyn
@ 2023-08-15  8:59   ` Christian Brauner
  2023-08-17 15:47     ` Michael Weiß
  2023-08-17 22:11     ` Alexei Starovoitov
  1 sibling, 2 replies; 18+ messages in thread
From: Christian Brauner @ 2023-08-15  8:59 UTC (permalink / raw)
  To: Michael Weiß
  Cc: Alexander Mikhalitsyn, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Quentin Monnet, Alexander Viro, bpf, linux-kernel, linux-fsdevel,
	gyroidos

On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
> Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
> which allows to set a cgroup device program to be a device guard.

Currently we block access to devices unconditionally in may_open_dev().
Anything that's mounted by an unprivileged containers will get
SB_I_NODEV set in s_i_flags.

Then we currently mediate device access in:

* inode_permission()
  -> devcgroup_inode_permission()
* vfs_mknod()
  -> devcgroup_inode_mknod()
* blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
  -> devcgroup_check_permission()
* drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
  -> devcgroup_check_permission()

All your new flag does is to bypass that SB_I_NODEV check afaict and let
it proceed to the devcgroup_*() checks for the vfs layer.

But I don't get the semantics yet.
Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
is that a flag on random bpf programs? It looks like it would be the
latter but design-wise I would expect this to be a property of the
device program itself.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-08-15  8:59   ` Christian Brauner
@ 2023-08-17 15:47     ` Michael Weiß
  2023-08-17 22:11     ` Alexei Starovoitov
  1 sibling, 0 replies; 18+ messages in thread
From: Michael Weiß @ 2023-08-17 15:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Mikhalitsyn, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Quentin Monnet, Alexander Viro, bpf, linux-kernel, linux-fsdevel,
	gyroidos

On 15.08.23 10:59, Christian Brauner wrote:
> On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
>> Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
>> which allows to set a cgroup device program to be a device guard.
> 
> Currently we block access to devices unconditionally in may_open_dev().
> Anything that's mounted by an unprivileged containers will get
> SB_I_NODEV set in s_i_flags.
> 
> Then we currently mediate device access in:
> 
> * inode_permission()
>   -> devcgroup_inode_permission()
> * vfs_mknod()
>   -> devcgroup_inode_mknod()
> * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
>   -> devcgroup_check_permission()
> * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
>   -> devcgroup_check_permission()
> 
> All your new flag does is to bypass that SB_I_NODEV check afaict and let
> it proceed to the devcgroup_*() checks for the vfs layer.

Yes. In an early version, I had the check in super.c to avoid setting the
SB_I_NODEV on mount. I thought it would be a less invasive change to do both
checks in one source file. But from an architecture point of view it would be
better that we do it there. Should we?

> 
> But I don't get the semantics yet.
> Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
> is that a flag on random bpf programs? It looks like it would be the
> latter but design-wise I would expect this to be a property of the
> device program itself.

Yes it's a flag on the bpf program which could be set during BPF_PROG_LOAD.
This was straight forward to be implemented similarly to the BPF_F_XDP_*
flags.

Cheers,
Michael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-08-14 15:54   ` Alexander Mikhalitsyn
@ 2023-08-17 15:50     ` Michael Weiß
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Weiß @ 2023-08-17 15:50 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Quentin Monnet, Alexander Viro, bpf, linux-kernel, linux-fsdevel,
	gyroidos, stgraber

On 14.08.23 17:54, Alexander Mikhalitsyn wrote:
> On Mon, Aug 14, 2023 at 4:32 PM Michael Weiß
> <michael.weiss@aisec.fraunhofer.de> wrote:
>>
>> Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
>> which allows to set a cgroup device program to be a device guard.
>> Later this may be used to guard actions on device nodes in
>> non-initial userns. For this reason we provide the helper function
>> cgroup_bpf_device_guard_enabled() to check if a task has a cgroups
>> device program which is a device guard in its effective set of bpf
>> programs.
>>
>> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
> 
> Hi Michael!
> 
> Thanks for working on this. It's also very useful for the LXC system
> containers project.
> 
>> ---
>>  include/linux/bpf-cgroup.h     |  7 +++++++
>>  include/linux/bpf.h            |  1 +
>>  include/uapi/linux/bpf.h       |  5 +++++
>>  kernel/bpf/cgroup.c            | 30 ++++++++++++++++++++++++++++++
>>  kernel/bpf/syscall.c           |  5 ++++-
>>  tools/include/uapi/linux/bpf.h |  5 +++++
>>  6 files changed, 52 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>> index 57e9e109257e..112b6093f9fd 100644
>> --- a/include/linux/bpf-cgroup.h
>> +++ b/include/linux/bpf-cgroup.h
>> @@ -184,6 +184,8 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
>>         return array != &bpf_empty_prog_array.hdr;
>>  }
>>
>> +bool cgroup_bpf_device_guard_enabled(struct task_struct *task);
>> +
>>  /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
>>  #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)                            \
>>  ({                                                                           \
>> @@ -476,6 +478,11 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
>>         return 0;
>>  }
>>
>> +static bool cgroup_bpf_device_guard_enabled(struct task_struct *task)
>> +{
>> +       return false;
>> +}
>> +
>>  #define cgroup_bpf_enabled(atype) (0)
>>  #define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, atype, t_ctx) ({ 0; })
>>  #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, atype) ({ 0; })
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index f58895830ada..313cce8aee05 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1384,6 +1384,7 @@ struct bpf_prog_aux {
>>         bool sleepable;
>>         bool tail_call_reachable;
>>         bool xdp_has_frags;
>> +       bool cgroup_device_guard;
>>         /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
>>         const struct btf_type *attach_func_proto;
>>         /* function name for valid attach_btf_id */
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 60a9d59beeab..3be57f7957b1 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1165,6 +1165,11 @@ enum bpf_link_type {
>>   */
>>  #define BPF_F_XDP_DEV_BOUND_ONLY       (1U << 6)
>>
>> +/* If BPF_F_CGROUP_DEVICE_GUARD is used in BPF_PROG_LOAD command, the loaded
>> + * program will be allowed to guard device access inside user namespaces.
>> + */
>> +#define BPF_F_CGROUP_DEVICE_GUARD      (1U << 7)
>> +
>>  /* link_create.kprobe_multi.flags used in LINK_CREATE command for
>>   * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
>>   */
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 5b2741aa0d9b..230693ca4cdb 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -2505,6 +2505,36 @@ const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
>>  const struct bpf_prog_ops cg_sockopt_prog_ops = {
>>  };
>>
>> +bool
>> +cgroup_bpf_device_guard_enabled(struct task_struct *task)
>> +{
>> +       bool ret;
>> +       const struct bpf_prog_array *array;
>> +       const struct bpf_prog_array_item *item;
>> +       const struct bpf_prog *prog;
>> +       struct cgroup *cgrp = task_dfl_cgroup(task);
>> +
>> +       ret = false;
>> +
>> +       array = rcu_access_pointer(cgrp->bpf.effective[CGROUP_DEVICE]);
>> +       if (array == &bpf_empty_prog_array.hdr)
>> +               return ret;
>> +
>> +       mutex_lock(&cgroup_mutex);
> 
> -> cgroup_lock();
> 
>> +       array = rcu_dereference_protected(cgrp->bpf.effective[CGROUP_DEVICE],
>> +                                             lockdep_is_held(&cgroup_mutex));
>> +       item = &array->items[0];
>> +       while ((prog = READ_ONCE(item->prog))) {
>> +               if (prog->aux->cgroup_device_guard) {
>> +                       ret = true;
>> +                       break;
>> +               }
>> +               item++;
>> +       }
>> +       mutex_unlock(&cgroup_mutex);
> 
> Don't we want to make this function specific to "current" task? This
> allows to make locking lighter like in
> __cgroup_bpf_check_dev_permission
> https://github.com/torvalds/linux/blob/2ccdd1b13c591d306f0401d98dedc4bdcd02b421/kernel/bpf/cgroup.c#L1531
> Here we have only RCU read lock.
> 
> AFAIK, cgroup_mutex is a heavily-contended lock.

Seems legit. So definitely we should do that. Thanks.

Cheers,
Michael
> 
> Kind regards,
> Alex
> 
>> +       return ret;
>> +}
>> +
>>  /* Common helpers for cgroup hooks. */
>>  const struct bpf_func_proto *
>>  cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index a2aef900519c..33ea67c702c1 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2564,7 +2564,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
>>                                  BPF_F_SLEEPABLE |
>>                                  BPF_F_TEST_RND_HI32 |
>>                                  BPF_F_XDP_HAS_FRAGS |
>> -                                BPF_F_XDP_DEV_BOUND_ONLY))
>> +                                BPF_F_XDP_DEV_BOUND_ONLY |
>> +                                BPF_F_CGROUP_DEVICE_GUARD))
>>                 return -EINVAL;
>>
>>         if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
>> @@ -2651,6 +2652,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
>>         prog->aux->dev_bound = !!attr->prog_ifindex;
>>         prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
>>         prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
>> +       prog->aux->cgroup_device_guard =
>> +               attr->prog_flags & BPF_F_CGROUP_DEVICE_GUARD;
>>
>>         err = security_bpf_prog_alloc(prog->aux);
>>         if (err)
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index 60a9d59beeab..3be57f7957b1 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -1165,6 +1165,11 @@ enum bpf_link_type {
>>   */
>>  #define BPF_F_XDP_DEV_BOUND_ONLY       (1U << 6)
>>
>> +/* If BPF_F_CGROUP_DEVICE_GUARD is used in BPF_PROG_LOAD command, the loaded
>> + * program will be allowed to guard device access inside user namespaces.
>> + */
>> +#define BPF_F_CGROUP_DEVICE_GUARD      (1U << 7)
>> +
>>  /* link_create.kprobe_multi.flags used in LINK_CREATE command for
>>   * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
>>   */
>>
>> --
>> 2.30.2
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-08-15  8:59   ` Christian Brauner
  2023-08-17 15:47     ` Michael Weiß
@ 2023-08-17 22:11     ` Alexei Starovoitov
  2023-08-29 13:35       ` Alexander Mikhalitsyn
  1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2023-08-17 22:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Michael Weiß, Alexander Mikhalitsyn, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Quentin Monnet, Alexander Viro, bpf,
	linux-kernel, linux-fsdevel, gyroidos

On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote:
> On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
> > Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
> > which allows to set a cgroup device program to be a device guard.
> 
> Currently we block access to devices unconditionally in may_open_dev().
> Anything that's mounted by an unprivileged containers will get
> SB_I_NODEV set in s_i_flags.
> 
> Then we currently mediate device access in:
> 
> * inode_permission()
>   -> devcgroup_inode_permission()
> * vfs_mknod()
>   -> devcgroup_inode_mknod()
> * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
>   -> devcgroup_check_permission()
> * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
>   -> devcgroup_check_permission()
> 
> All your new flag does is to bypass that SB_I_NODEV check afaict and let
> it proceed to the devcgroup_*() checks for the vfs layer.
> 
> But I don't get the semantics yet.
> Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
> is that a flag on random bpf programs? It looks like it would be the
> latter but design-wise I would expect this to be a property of the
> device program itself.

Looks like patch 4 attemps to bypass usual permission checks with:
@@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
        if (error)
                return error;

-       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
-           !capable(CAP_MKNOD))
-               return -EPERM;
+       /*
+        * In case of a device cgroup restirction allow mknod in user
+        * namespace. Otherwise just check global capability; thus,
+        * mknod is also disabled for user namespace other than the
+        * initial one.
+        */
+       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
+               if (devcgroup_task_is_guarded(current)) {
+                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
+                               return -EPERM;
+               } else if (!capable(CAP_MKNOD))
+                       return -EPERM;
+       }

which pretty much sounds like authoritative LSM that was brought up in the past
and LSM folks didn't like it.

If vfs folks are ok with this special bypass of permissions in vfs_mknod()
we can talk about kernel->bpf api details.
The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go,
but no point going into bpf details now until agreement on bypass is made.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-08-17 22:11     ` Alexei Starovoitov
@ 2023-08-29 13:35       ` Alexander Mikhalitsyn
  2023-09-04 11:44         ` Christian Brauner
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2023-08-29 13:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Christian Brauner, Michael Weiß, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Quentin Monnet, Alexander Viro, bpf,
	linux-kernel, linux-fsdevel, gyroidos, paul, Miklos Szeredi,
	Amir Goldstein

On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote:
> > On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
> > > Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
> > > which allows to set a cgroup device program to be a device guard.
> >
> > Currently we block access to devices unconditionally in may_open_dev().
> > Anything that's mounted by an unprivileged containers will get
> > SB_I_NODEV set in s_i_flags.
> >
> > Then we currently mediate device access in:
> >
> > * inode_permission()
> >   -> devcgroup_inode_permission()
> > * vfs_mknod()
> >   -> devcgroup_inode_mknod()
> > * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
> >   -> devcgroup_check_permission()
> > * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
> >   -> devcgroup_check_permission()
> >
> > All your new flag does is to bypass that SB_I_NODEV check afaict and let
> > it proceed to the devcgroup_*() checks for the vfs layer.
> >
> > But I don't get the semantics yet.
> > Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
> > is that a flag on random bpf programs? It looks like it would be the
> > latter but design-wise I would expect this to be a property of the
> > device program itself.
>
> Looks like patch 4 attemps to bypass usual permission checks with:
> @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
>         if (error)
>                 return error;
>
> -       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
> -           !capable(CAP_MKNOD))
> -               return -EPERM;
> +       /*
> +        * In case of a device cgroup restirction allow mknod in user
> +        * namespace. Otherwise just check global capability; thus,
> +        * mknod is also disabled for user namespace other than the
> +        * initial one.
> +        */
> +       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
> +               if (devcgroup_task_is_guarded(current)) {
> +                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
> +                               return -EPERM;
> +               } else if (!capable(CAP_MKNOD))
> +                       return -EPERM;
> +       }
>

Dear colleagues,

> which pretty much sounds like authoritative LSM that was brought up in the past
> and LSM folks didn't like it.

Thanks for pointing this out, Alexei!
I've searched through the LKML archives and found a thread about this:
https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@mail.gmail.com/

As far as I understand, disagreement here is about a practice of
skipping kernel-built capability checks based
on LSM hooks, right?

+CC Paul Moore <paul@paul-moore.com>

>
> If vfs folks are ok with this special bypass of permissions in vfs_mknod()
> we can talk about kernel->bpf api details.
> The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go,
> but no point going into bpf details now until agreement on bypass is made.

+CC Miklos Szeredi <miklos@szeredi.hu>
+CC Amir Goldstein <amir73il@gmail.com>

Kind regards,
Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-08-29 13:35       ` Alexander Mikhalitsyn
@ 2023-09-04 11:44         ` Christian Brauner
  2023-09-11 10:38           ` Michael Weiß
  2023-09-11 19:20           ` Paul Moore
  0 siblings, 2 replies; 18+ messages in thread
From: Christian Brauner @ 2023-09-04 11:44 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, Alexei Starovoitov, Michael Weiß
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Quentin Monnet, Alexander Viro, bpf,
	linux-kernel, linux-fsdevel, gyroidos, paul, Miklos Szeredi,
	Amir Goldstein

On Tue, Aug 29, 2023 at 03:35:46PM +0200, Alexander Mikhalitsyn wrote:
> On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote:
> > > On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
> > > > Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
> > > > which allows to set a cgroup device program to be a device guard.
> > >
> > > Currently we block access to devices unconditionally in may_open_dev().
> > > Anything that's mounted by an unprivileged containers will get
> > > SB_I_NODEV set in s_i_flags.
> > >
> > > Then we currently mediate device access in:
> > >
> > > * inode_permission()
> > >   -> devcgroup_inode_permission()
> > > * vfs_mknod()
> > >   -> devcgroup_inode_mknod()
> > > * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
> > >   -> devcgroup_check_permission()
> > > * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
> > >   -> devcgroup_check_permission()
> > >
> > > All your new flag does is to bypass that SB_I_NODEV check afaict and let
> > > it proceed to the devcgroup_*() checks for the vfs layer.
> > >
> > > But I don't get the semantics yet.
> > > Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
> > > is that a flag on random bpf programs? It looks like it would be the
> > > latter but design-wise I would expect this to be a property of the
> > > device program itself.
> >
> > Looks like patch 4 attemps to bypass usual permission checks with:
> > @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
> >         if (error)
> >                 return error;
> >
> > -       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
> > -           !capable(CAP_MKNOD))
> > -               return -EPERM;
> > +       /*
> > +        * In case of a device cgroup restirction allow mknod in user
> > +        * namespace. Otherwise just check global capability; thus,
> > +        * mknod is also disabled for user namespace other than the
> > +        * initial one.
> > +        */
> > +       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
> > +               if (devcgroup_task_is_guarded(current)) {
> > +                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
> > +                               return -EPERM;
> > +               } else if (!capable(CAP_MKNOD))
> > +                       return -EPERM;
> > +       }
> >
> 
> Dear colleagues,
> 
> > which pretty much sounds like authoritative LSM that was brought up in the past
> > and LSM folks didn't like it.
> 
> Thanks for pointing this out, Alexei!
> I've searched through the LKML archives and found a thread about this:
> https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@mail.gmail.com/
> 
> As far as I understand, disagreement here is about a practice of
> skipping kernel-built capability checks based
> on LSM hooks, right?
> 
> +CC Paul Moore <paul@paul-moore.com>
> 
> >
> > If vfs folks are ok with this special bypass of permissions in vfs_mknod()
> > we can talk about kernel->bpf api details.
> > The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go,
> > but no point going into bpf details now until agreement on bypass is made.

Afaiu the original concern was specifically about an LSM allowing to
bypass other LSMs or DAC permissions. But this wouldn't be the case
here. The general inode access LSM permission mediation is separate from
specific device access management: the security_inode_permission() LSM
hook would still be called and thus LSMs restrictions would continue to
apply exactly as they do now.

For cgroup v1 device access management was a cgroup controller with
management interface through files. It then was ported to an eBPF
program attachable to cgroups for cgroup v2. Arguably, it should
probably have been ported to an LSM hook or a separate LSM and untied
from cgroups completely. The confusion here seems to indicate that that
would have been the right way to go.

Because right now device access management seems its own form of
mandatory access control.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-09-04 11:44         ` Christian Brauner
@ 2023-09-11 10:38           ` Michael Weiß
  2023-09-11 12:35             ` Christian Brauner
  2023-09-11 19:20           ` Paul Moore
  1 sibling, 1 reply; 18+ messages in thread
From: Michael Weiß @ 2023-09-11 10:38 UTC (permalink / raw)
  To: Christian Brauner, Alexander Mikhalitsyn, Alexei Starovoitov
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Quentin Monnet, Alexander Viro, bpf,
	linux-kernel, linux-fsdevel, gyroidos, paul, Miklos Szeredi,
	Amir Goldstein

On 04.09.23 13:44, Christian Brauner wrote:
> On Tue, Aug 29, 2023 at 03:35:46PM +0200, Alexander Mikhalitsyn wrote:
>> On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote:
>>>> On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
>>>>> Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
>>>>> which allows to set a cgroup device program to be a device guard.
>>>>
>>>> Currently we block access to devices unconditionally in may_open_dev().
>>>> Anything that's mounted by an unprivileged containers will get
>>>> SB_I_NODEV set in s_i_flags.
>>>>
>>>> Then we currently mediate device access in:
>>>>
>>>> * inode_permission()
>>>>   -> devcgroup_inode_permission()
>>>> * vfs_mknod()
>>>>   -> devcgroup_inode_mknod()
>>>> * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
>>>>   -> devcgroup_check_permission()
>>>> * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
>>>>   -> devcgroup_check_permission()
>>>>
>>>> All your new flag does is to bypass that SB_I_NODEV check afaict and let
>>>> it proceed to the devcgroup_*() checks for the vfs layer.
>>>>
>>>> But I don't get the semantics yet.
>>>> Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
>>>> is that a flag on random bpf programs? It looks like it would be the
>>>> latter but design-wise I would expect this to be a property of the
>>>> device program itself.
>>>
>>> Looks like patch 4 attemps to bypass usual permission checks with:
>>> @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
>>>         if (error)
>>>                 return error;
>>>
>>> -       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
>>> -           !capable(CAP_MKNOD))
>>> -               return -EPERM;
>>> +       /*
>>> +        * In case of a device cgroup restirction allow mknod in user
>>> +        * namespace. Otherwise just check global capability; thus,
>>> +        * mknod is also disabled for user namespace other than the
>>> +        * initial one.
>>> +        */
>>> +       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
>>> +               if (devcgroup_task_is_guarded(current)) {
>>> +                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
>>> +                               return -EPERM;
>>> +               } else if (!capable(CAP_MKNOD))
>>> +                       return -EPERM;
>>> +       }
>>>
>>
>> Dear colleagues,
>>
>>> which pretty much sounds like authoritative LSM that was brought up in the past
>>> and LSM folks didn't like it.
>>
>> Thanks for pointing this out, Alexei!
>> I've searched through the LKML archives and found a thread about this:
>> https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@mail.gmail.com/
>>
>> As far as I understand, disagreement here is about a practice of
>> skipping kernel-built capability checks based
>> on LSM hooks, right?
>>
>> +CC Paul Moore <paul@paul-moore.com>
>>
>>>
>>> If vfs folks are ok with this special bypass of permissions in vfs_mknod()
>>> we can talk about kernel->bpf api details.
>>> The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go,
>>> but no point going into bpf details now until agreement on bypass is made.
> 
> Afaiu the original concern was specifically about an LSM allowing to
> bypass other LSMs or DAC permissions. But this wouldn't be the case
> here. The general inode access LSM permission mediation is separate from
> specific device access management: the security_inode_permission() LSM
> hook would still be called and thus LSMs restrictions would continue to
> apply exactly as they do now.

So are OK with the checks here?
> 
> For cgroup v1 device access management was a cgroup controller with
> management interface through files. It then was ported to an eBPF
> program attachable to cgroups for cgroup v2. Arguably, it should
> probably have been ported to an LSM hook or a separate LSM and untied
> from cgroups completely. The confusion here seems to indicate that that
> would have been the right way to go.
> 
> Because right now device access management seems its own form of
> mandatory access control.

I'm currently testing an updated version which has incorporated the locking
changes already mention by Alex and the change which avoids setting SB_I_NODEV
in fs/super.c.

Does is make sense to send out a v2 to further discuss BPF related changes?

Thnx,
Michael

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-09-11 10:38           ` Michael Weiß
@ 2023-09-11 12:35             ` Christian Brauner
  0 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2023-09-11 12:35 UTC (permalink / raw)
  To: Michael Weiß
  Cc: Alexander Mikhalitsyn, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Quentin Monnet, Alexander Viro, bpf, linux-kernel, linux-fsdevel,
	gyroidos, paul, Miklos Szeredi, Amir Goldstein

> So are OK with the checks here?

I'm ok with figuring out whether we can do this nicely, yes.

> > Because right now device access management seems its own form of
> > mandatory access control.
> 
> I'm currently testing an updated version which has incorporated the locking
> changes already mention by Alex and the change which avoids setting SB_I_NODEV
> in fs/super.c.

Not having to hack around SB_I_NODEV would be pretty crucial imho. It's
a core security assumption so we need to integrate with it nicely.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog
  2023-09-04 11:44         ` Christian Brauner
  2023-09-11 10:38           ` Michael Weiß
@ 2023-09-11 19:20           ` Paul Moore
  1 sibling, 0 replies; 18+ messages in thread
From: Paul Moore @ 2023-09-11 19:20 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, Christian Brauner
  Cc: Alexei Starovoitov, Michael Weiß, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Quentin Monnet, Alexander Viro, bpf, linux-kernel, linux-fsdevel,
	gyroidos, Miklos Szeredi, Amir Goldstein

On Mon, Sep 4, 2023 at 7:44 AM Christian Brauner <brauner@kernel.org> wrote:
> On Tue, Aug 29, 2023 at 03:35:46PM +0200, Alexander Mikhalitsyn wrote:
> > On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote:
> > > > On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
> > > > > Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
> > > > > which allows to set a cgroup device program to be a device guard.
> > > >
> > > > Currently we block access to devices unconditionally in may_open_dev().
> > > > Anything that's mounted by an unprivileged containers will get
> > > > SB_I_NODEV set in s_i_flags.
> > > >
> > > > Then we currently mediate device access in:
> > > >
> > > > * inode_permission()
> > > >   -> devcgroup_inode_permission()
> > > > * vfs_mknod()
> > > >   -> devcgroup_inode_mknod()
> > > > * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
> > > >   -> devcgroup_check_permission()
> > > > * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
> > > >   -> devcgroup_check_permission()
> > > >
> > > > All your new flag does is to bypass that SB_I_NODEV check afaict and let
> > > > it proceed to the devcgroup_*() checks for the vfs layer.
> > > >
> > > > But I don't get the semantics yet.
> > > > Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
> > > > is that a flag on random bpf programs? It looks like it would be the
> > > > latter but design-wise I would expect this to be a property of the
> > > > device program itself.
> > >
> > > Looks like patch 4 attemps to bypass usual permission checks with:
> > > @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
> > >         if (error)
> > >                 return error;
> > >
> > > -       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
> > > -           !capable(CAP_MKNOD))
> > > -               return -EPERM;
> > > +       /*
> > > +        * In case of a device cgroup restirction allow mknod in user
> > > +        * namespace. Otherwise just check global capability; thus,
> > > +        * mknod is also disabled for user namespace other than the
> > > +        * initial one.
> > > +        */
> > > +       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
> > > +               if (devcgroup_task_is_guarded(current)) {
> > > +                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
> > > +                               return -EPERM;
> > > +               } else if (!capable(CAP_MKNOD))
> > > +                       return -EPERM;
> > > +       }
> > >
> >
> > Dear colleagues,
> >
> > > which pretty much sounds like authoritative LSM that was brought up in the past
> > > and LSM folks didn't like it.
> >
> > Thanks for pointing this out, Alexei!
> > I've searched through the LKML archives and found a thread about this:
> > https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@mail.gmail.com/
> >
> > As far as I understand, disagreement here is about a practice of
> > skipping kernel-built capability checks based
> > on LSM hooks, right?
> >
> > +CC Paul Moore <paul@paul-moore.com>
> >
> > >
> > > If vfs folks are ok with this special bypass of permissions in vfs_mknod()
> > > we can talk about kernel->bpf api details.
> > > The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go,
> > > but no point going into bpf details now until agreement on bypass is made.
>
> Afaiu the original concern was specifically about an LSM allowing to
> bypass other LSMs or DAC permissions. But this wouldn't be the case
> here. The general inode access LSM permission mediation is separate from
> specific device access management: the security_inode_permission() LSM
> hook would still be called and thus LSMs restrictions would continue to
> apply exactly as they do now.
>
> For cgroup v1 device access management was a cgroup controller with
> management interface through files. It then was ported to an eBPF
> program attachable to cgroups for cgroup v2. Arguably, it should
> probably have been ported to an LSM hook or a separate LSM and untied
> from cgroups completely. The confusion here seems to indicate that that
> would have been the right way to go.
>
> Because right now device access management seems its own form of
> mandatory access control.

My apologies, I lost this thread in my inbox and I'm just reading it now.

Historically we haven't any real LSM controls around
cgroups/resource-management, but that is mostly because everything
that I recall being proposed has been very piecemeal and didn't
provide a comprehensive access control solution for resource
management.  If someone wanted to propose a proper set of access
control hooks for cgroups I think that is something we would be happy
to review, merge, etc.

Somewhat relatedly, we've been working on some docs around guidelines
for new LSM hooks; eventually the guidelines will work their way into
the kernel docs, but here they are now:

* https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-hook-guidelines

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-09-11 21:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-14 14:26 [PATCH RFC 0/4] bpf: cgroup device guard for non-initial user namespace Michael Weiß
2023-08-14 14:26 ` [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog Michael Weiß
2023-08-14 15:54   ` Alexander Mikhalitsyn
2023-08-17 15:50     ` Michael Weiß
2023-08-15  8:59   ` Christian Brauner
2023-08-17 15:47     ` Michael Weiß
2023-08-17 22:11     ` Alexei Starovoitov
2023-08-29 13:35       ` Alexander Mikhalitsyn
2023-09-04 11:44         ` Christian Brauner
2023-09-11 10:38           ` Michael Weiß
2023-09-11 12:35             ` Christian Brauner
2023-09-11 19:20           ` Paul Moore
2023-08-14 14:26 ` [PATCH RFC 2/4] bpf: provide cgroup_device_guard in bpf_prog_info to user space Michael Weiß
2023-08-14 14:26 ` [PATCH RFC 3/4] device_cgroup: wrapper for bpf cgroup device guard Michael Weiß
2023-08-14 14:26 ` [PATCH RFC 4/4] fs: allow mknod in non-initial userns using " Michael Weiß
2023-08-14 15:24   ` Alexander Mikhalitsyn
2023-08-15  7:18   ` kernel test robot
2023-08-15  7:49     ` Alexander Mikhalitsyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).