[RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt
@ 2023-08-15 17:47 thinker.li
  2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: thinker.li @ 2023-08-15 17:47 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song
  Cc: sinquersw, kuifeng, Kui-Feng Lee

From: Kui-Feng Lee <thinker.li@gmail.com>

Major Changes from v2:

 - Add test cases mixing sleepable and non-sleepable BPF programs.

 - Don't expose bpf_sockopt_kern.flags to BPF programs.

 - Rename kfuncs to *_sockopt_dynptr_*()

Major Changes from v1:

 - Add bpf_so_optval_copy_to() and bpf_so_optval_copy_to_r() to copy
   data from a dynptr or raw buffer to the optval of a context, either
   in the kernel or user space, to simplify BPF programs.

 - Restrict to having atmost one instance of dynptr initialized by
   bpf_so_optval_from() at any moment.  It simplifies the memory
   management of the optval buffer in kernel.

 - Fix the issue of bpf_prog_array_free() by replacing it with
   bpf_prog_array_free_sleepable().

Make BPF programs attached on cgroup/{get,set}sockopt hooks sleepable
and able to call bpf_copy_from_user() and bpf_copy_to_user(), a new
kfunc.

The Issue with CGroup {get,set}sockopt Hooks
============================================

Calling {get,set}sockopt from user space, optval is a pointer to a
buffer. The format of the buffer depends on the level and optname, and
its size is specified by optlen. The buffer is used by user space
programs to pass values to setsockopt and retrieve values from
getsockopt.

The problem is that BPF programs protected by RCU read lock cannot
access the buffers located in user space. This is because these
programs are non-sleepable and using copy_from_user() or
copy_to_user() to access user space memory can result in paging.

The kernel makes a copy of the buffer specified by optval and optlen
in kernel space before passing it to the cgroup {get,set}sockopt
hooks. After the hooks are executed, the content of the buffer in
kernel space is copied to user space if necessary.

Programs may send a significant amount of data, stored in buffer
indicated by optval, to the kernel. One example is iptables, which can
send several megabytes to the kernel. However, BPF programs on the
hooks can only see up to the first PAGE_SIZE bytes of the buffer. The
optlen value that BPF programs observe may appear to be PAGE_SIZE, but
in reality, it is larger than that. On the other hand, the value of
optlen represents the amount of data retrieved by
getsockopt(). Additionally, both the buffer content and optlen can be
modified by BPF programs.

Kernel may wrongly modify the value of optlen returned to user space
to PAGE_SIZE. This can happen because the kernel cannot distinguish if
the value was set by BPF programs or by the kernel itself.

To fix it, we perform various hacks; for example, the commit d8fe449a9c51
("bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE")
and the commit 29ebbba7d461 ("bpf: Don't EFAULT for {g,s}setsockopt with
 wrong optlen").

Make CGroup {get,set}sockopt Hooks Sleepable
============================================

The long term solution is to make these hooks sleepable to enable BPF
programs call copy_from_user() and copy_to_user(),
a.k.a. bpf_copy_from_user() and bpf_copy_to_user(). It prevents
manipulation of optval and optlen values, and allows BPF programs to
access the complete contents of the buffer referenced by optval.

Mix Sleepable and Non-Sleepable Programs
========================================

Installing both sleepable and non-sleepable programs simultaneously on
the same hook leads to the mixing of sleepable and non-sleepable
programs. For programs that cannot sleep, the kernel first copies data
from the user buffer to a kernel buffer before invoking BPF
programs. This process introduces intricate interactions between
sleepable and non-sleepable programs.

For instance, due to kernel copies for non-sleepable programs, a
sleepable program may receive optval in either the user space or the
kernel space. These two scenarios require different handling
approaches to update the buffer pointed to by optval. Consequently,
sleepable programs can become significantly complex.

To simplify the programs, we introduce a set of kfuncs that enable
data copying to optval without requiring knowledge of the underlying
details.  (bpf_so_optval_copy_to() and bpf_so_optval_copy_to_r())

---
v1: https://lore.kernel.org/bpf/20230722052248.1062582-1-kuifeng@meta.com/
v2: https://lore.kernel.org/all/20230811043127.1318152-1-thinker.li@gmail.com/

Kui-Feng Lee (5):
  bpf: enable sleepable BPF programs attached to
    cgroup/{get,set}sockopt.
  libbpf: add sleepable sections for {get,set}sockopt()
  bpf: Prevent BPF programs from access the buffer pointed by
    user_optval.
  bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  selftests/bpf: Add test cases for sleepable BPF programs of the
    CGROUP_SOCKOPT type

 include/linux/bpf.h                           |  13 +-
 include/linux/filter.h                        |  10 +
 kernel/bpf/btf.c                              |   3 +
 kernel/bpf/cgroup.c                           | 226 +++++++++++----
 kernel/bpf/helpers.c                          | 197 ++++++++++++++
 kernel/bpf/verifier.c                         | 118 +++++---
 tools/lib/bpf/libbpf.c                        |   2 +
 .../testing/selftests/bpf/bpf_experimental.h  |  36 +++
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  41 +++
 .../selftests/bpf/prog_tests/sockopt_sk.c     | 112 +++++++-
 .../testing/selftests/bpf/progs/sockopt_sk.c  | 257 ++++++++++++++++++
 .../selftests/bpf/verifier/sleepable.c        |   2 +-
 12 files changed, 929 insertions(+), 88 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
@ 2023-08-15 17:47 ` thinker.li
  2023-08-15 20:58   ` Stanislav Fomichev
  2023-08-16  0:42   ` kernel test robot
  2023-08-15 17:47 ` [RFC bpf-next v3 2/5] libbpf: add sleepable sections for {get,set}sockopt() thinker.li
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 28+ messages in thread
From: thinker.li @ 2023-08-15 17:47 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song
  Cc: sinquersw, kuifeng, Kui-Feng Lee

From: Kui-Feng Lee <thinker.li@gmail.com>

Enable sleepable cgroup/{get,set}sockopt hooks.

The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
received a pointer to the optval in user space instead of a kernel
copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
begin and end of the user space buffer if receiving a user space
buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
a kernel space buffer.

A program receives a user space buffer if ctx->flags &
BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
buffer.  The BPF programs should not read/write from/to a user space buffer
dirrectly.  It should access the buffer through bpf_copy_from_user() and
bpf_copy_to_user() provided in the following patches.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h    |   6 ++
 include/linux/filter.h |   6 ++
 kernel/bpf/cgroup.c    | 207 ++++++++++++++++++++++++++++++++---------
 kernel/bpf/verifier.c  |   5 +-
 4 files changed, 177 insertions(+), 47 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cfabbcf47bdb..edb35bcfa548 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1769,9 +1769,15 @@ struct bpf_prog_array_item {
 
 struct bpf_prog_array {
 	struct rcu_head rcu;
+	u32 flags;
 	struct bpf_prog_array_item items[];
 };
 
+enum bpf_prog_array_flags {
+	BPF_PROG_ARRAY_F_SLEEPABLE = 1 << 0,
+	BPF_PROG_ARRAY_F_NON_SLEEPABLE = 1 << 1,
+};
+
 struct bpf_empty_prog_array {
 	struct bpf_prog_array hdr;
 	struct bpf_prog *null_prog;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 761af6b3cf2b..2aa2a96526de 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1337,12 +1337,18 @@ struct bpf_sockopt_kern {
 	s32		level;
 	s32		optname;
 	s32		optlen;
+	u32		flags;
 	/* for retval in struct bpf_cg_run_ctx */
 	struct task_struct *current_task;
 	/* Temporary "register" for indirect stores to ppos. */
 	u64		tmp_reg;
 };
 
+enum bpf_sockopt_kern_flags {
+	/* optval is a pointer to user space memory */
+	BPF_SOCKOPT_FLAG_OPTVAL_USER    = (1U << 0),
+};
+
 int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len);
 
 struct bpf_sk_lookup_kern {
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 5b2741aa0d9b..b977768a28e5 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -28,25 +28,46 @@ EXPORT_SYMBOL(cgroup_bpf_enabled_key);
  * function pointer.
  */
 static __always_inline int
-bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
-		      enum cgroup_bpf_attach_type atype,
-		      const void *ctx, bpf_prog_run_fn run_prog,
-		      int retval, u32 *ret_flags)
+bpf_prog_run_array_cg_cb(const struct cgroup_bpf *cgrp,
+			 enum cgroup_bpf_attach_type atype,
+			 const void *ctx, bpf_prog_run_fn run_prog,
+			 int retval, u32 *ret_flags,
+			 int (*progs_cb)(void *, const struct bpf_prog_array *),
+			 void *progs_cb_arg)
 {
 	const struct bpf_prog_array_item *item;
 	const struct bpf_prog *prog;
 	const struct bpf_prog_array *array;
 	struct bpf_run_ctx *old_run_ctx;
 	struct bpf_cg_run_ctx run_ctx;
+	bool do_sleepable;
 	u32 func_ret;
+	int err;
+
+	do_sleepable =
+		atype == CGROUP_SETSOCKOPT || atype == CGROUP_GETSOCKOPT;
 
 	run_ctx.retval = retval;
 	migrate_disable();
-	rcu_read_lock();
+	if (do_sleepable) {
+		might_fault();
+		rcu_read_lock_trace();
+	} else
+		rcu_read_lock();
 	array = rcu_dereference(cgrp->effective[atype]);
 	item = &array->items[0];
+
+	if (progs_cb) {
+		err = progs_cb(progs_cb_arg, array);
+		if (err)
+			return err;
+	}
+
 	old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
 	while ((prog = READ_ONCE(item->prog))) {
+		if (do_sleepable && !prog->aux->sleepable)
+			rcu_read_lock();
+
 		run_ctx.prog_item = item;
 		func_ret = run_prog(prog, ctx);
 		if (ret_flags) {
@@ -56,13 +77,29 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
 		if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval))
 			run_ctx.retval = -EPERM;
 		item++;
+
+		if (do_sleepable && !prog->aux->sleepable)
+			rcu_read_unlock();
 	}
 	bpf_reset_run_ctx(old_run_ctx);
-	rcu_read_unlock();
+	if (do_sleepable)
+		rcu_read_unlock_trace();
+	else
+		rcu_read_unlock();
 	migrate_enable();
 	return run_ctx.retval;
 }
 
+static __always_inline int
+bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
+		      enum cgroup_bpf_attach_type atype,
+		      const void *ctx, bpf_prog_run_fn run_prog,
+		      int retval, u32 *ret_flags)
+{
+	return bpf_prog_run_array_cg_cb(cgrp, atype, ctx, run_prog, retval,
+					ret_flags, NULL, NULL);
+}
+
 unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
 				       const struct bpf_insn *insn)
 {
@@ -307,7 +344,7 @@ static void cgroup_bpf_release(struct work_struct *work)
 		old_array = rcu_dereference_protected(
 				cgrp->bpf.effective[atype],
 				lockdep_is_held(&cgroup_mutex));
-		bpf_prog_array_free(old_array);
+		bpf_prog_array_free_sleepable(old_array);
 	}
 
 	list_for_each_entry_safe(storage, stmp, storages, list_cg) {
@@ -402,6 +439,7 @@ static int compute_effective_progs(struct cgroup *cgrp,
 				   enum cgroup_bpf_attach_type atype,
 				   struct bpf_prog_array **array)
 {
+	bool has_non_sleepable = false, has_sleepable = false;
 	struct bpf_prog_array_item *item;
 	struct bpf_prog_array *progs;
 	struct bpf_prog_list *pl;
@@ -434,10 +472,19 @@ static int compute_effective_progs(struct cgroup *cgrp,
 			item->prog = prog_list_prog(pl);
 			bpf_cgroup_storages_assign(item->cgroup_storage,
 						   pl->storage);
+			if (item->prog->aux->sleepable)
+				has_sleepable = true;
+			else
+				has_non_sleepable = true;
 			cnt++;
 		}
 	} while ((p = cgroup_parent(p)));
 
+	if (has_non_sleepable)
+		progs->flags |= BPF_PROG_ARRAY_F_NON_SLEEPABLE;
+	if (has_sleepable)
+		progs->flags |= BPF_PROG_ARRAY_F_SLEEPABLE;
+
 	*array = progs;
 	return 0;
 }
@@ -451,7 +498,7 @@ static void activate_effective_progs(struct cgroup *cgrp,
 	/* free prog array after grace period, since __cgroup_bpf_run_*()
 	 * might be still walking the array
 	 */
-	bpf_prog_array_free(old_array);
+	bpf_prog_array_free_sleepable(old_array);
 }
 
 /**
@@ -491,7 +538,7 @@ int cgroup_bpf_inherit(struct cgroup *cgrp)
 	return 0;
 cleanup:
 	for (i = 0; i < NR; i++)
-		bpf_prog_array_free(arrays[i]);
+		bpf_prog_array_free_sleepable(arrays[i]);
 
 	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
 		cgroup_bpf_put(p);
@@ -525,7 +572,7 @@ static int update_effective_progs(struct cgroup *cgrp,
 
 		if (percpu_ref_is_zero(&desc->bpf.refcnt)) {
 			if (unlikely(desc->bpf.inactive)) {
-				bpf_prog_array_free(desc->bpf.inactive);
+				bpf_prog_array_free_sleepable(desc->bpf.inactive);
 				desc->bpf.inactive = NULL;
 			}
 			continue;
@@ -544,7 +591,7 @@ static int update_effective_progs(struct cgroup *cgrp,
 	css_for_each_descendant_pre(css, &cgrp->self) {
 		struct cgroup *desc = container_of(css, struct cgroup, self);
 
-		bpf_prog_array_free(desc->bpf.inactive);
+		bpf_prog_array_free_sleepable(desc->bpf.inactive);
 		desc->bpf.inactive = NULL;
 	}
 
@@ -1740,7 +1787,7 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
 
 #ifdef CONFIG_NET
 static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
-			     struct bpf_sockopt_buf *buf)
+			     struct bpf_sockopt_buf *buf, bool force_alloc)
 {
 	if (unlikely(max_optlen < 0))
 		return -EINVAL;
@@ -1752,7 +1799,7 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
 		max_optlen = PAGE_SIZE;
 	}
 
-	if (max_optlen <= sizeof(buf->data)) {
+	if (max_optlen <= sizeof(buf->data) && !force_alloc) {
 		/* When the optval fits into BPF_SOCKOPT_KERN_BUF_SIZE
 		 * bytes avoid the cost of kzalloc.
 		 */
@@ -1773,7 +1820,8 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
 static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
 			     struct bpf_sockopt_buf *buf)
 {
-	if (ctx->optval == buf->data)
+	if (ctx->optval == buf->data ||
+	    ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
 		return;
 	kfree(ctx->optval);
 }
@@ -1781,7 +1829,47 @@ static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
 static bool sockopt_buf_allocated(struct bpf_sockopt_kern *ctx,
 				  struct bpf_sockopt_buf *buf)
 {
-	return ctx->optval != buf->data;
+	return ctx->optval != buf->data &&
+		!(ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER);
+}
+
+struct filter_sockopt_cb_args {
+	struct bpf_sockopt_kern *ctx;
+	struct bpf_sockopt_buf *buf;
+	int max_optlen;
+};
+
+static int filter_setsockopt_progs_cb(void *arg,
+				      const struct bpf_prog_array *progs)
+{
+	struct filter_sockopt_cb_args *cb_args = arg;
+	struct bpf_sockopt_kern *ctx = cb_args->ctx;
+	char *optval = ctx->optval;
+	int max_optlen;
+
+	if (!(progs->flags & BPF_PROG_ARRAY_F_NON_SLEEPABLE))
+		return 0;
+
+	/* Allocate a bit more than the initial user buffer for
+	 * BPF program. The canonical use case is overriding
+	 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
+	 */
+	max_optlen = max_t(int, 16, ctx->optlen);
+	/* We need to force allocating from heap if there are sleepable
+	 * programs since they may created dynptrs from ctx->optval. In
+	 * this case, dynptrs will try to free the buffer that is actually
+	 * on the stack without this flag.
+	 */
+	max_optlen = sockopt_alloc_buf(ctx, max_optlen, cb_args->buf,
+				       progs->flags & BPF_PROG_ARRAY_F_SLEEPABLE);
+	if (max_optlen < 0)
+		return max_optlen;
+
+	if (copy_from_user(ctx->optval, optval,
+			   min(ctx->optlen, max_optlen)) != 0)
+		return -EFAULT;
+
+	return 0;
 }
 
 int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
@@ -1795,27 +1883,22 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 		.level = *level,
 		.optname = *optname,
 	};
+	struct filter_sockopt_cb_args cb_args = {
+		.ctx = &ctx,
+		.buf = &buf,
+	};
 	int ret, max_optlen;
 
-	/* Allocate a bit more than the initial user buffer for
-	 * BPF program. The canonical use case is overriding
-	 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
-	 */
-	max_optlen = max_t(int, 16, *optlen);
-	max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
-	if (max_optlen < 0)
-		return max_optlen;
-
+	max_optlen = *optlen;
 	ctx.optlen = *optlen;
-
-	if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
-		ret = -EFAULT;
-		goto out;
-	}
+	ctx.optval = optval;
+	ctx.optval_end = optval + *optlen;
+	ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
 
 	lock_sock(sk);
-	ret = bpf_prog_run_array_cg(&cgrp->bpf, CGROUP_SETSOCKOPT,
-				    &ctx, bpf_prog_run, 0, NULL);
+	ret = bpf_prog_run_array_cg_cb(&cgrp->bpf, CGROUP_SETSOCKOPT,
+				       &ctx, bpf_prog_run, 0, NULL,
+				       filter_setsockopt_progs_cb, &cb_args);
 	release_sock(sk);
 
 	if (ret)
@@ -1824,7 +1907,8 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 	if (ctx.optlen == -1) {
 		/* optlen set to -1, bypass kernel */
 		ret = 1;
-	} else if (ctx.optlen > max_optlen || ctx.optlen < -1) {
+	} else if (ctx.optlen > (ctx.optval_end - ctx.optval) ||
+		   ctx.optlen < -1) {
 		/* optlen is out of bounds */
 		if (*optlen > PAGE_SIZE && ctx.optlen >= 0) {
 			pr_info_once("bpf setsockopt: ignoring program buffer with optlen=%d (max_optlen=%d)\n",
@@ -1846,6 +1930,8 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 		 */
 		if (ctx.optlen != 0) {
 			*optlen = ctx.optlen;
+			if (ctx.flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
+				return 0;
 			/* We've used bpf_sockopt_kern->buf as an intermediary
 			 * storage, but the BPF program indicates that we need
 			 * to pass this data to the kernel setsockopt handler.
@@ -1874,6 +1960,33 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 	return ret;
 }
 
+static int filter_getsockopt_progs_cb(void *arg,
+				      const struct bpf_prog_array *progs)
+{
+	struct filter_sockopt_cb_args *cb_args = arg;
+	struct bpf_sockopt_kern *ctx = cb_args->ctx;
+	int max_optlen;
+	char *optval;
+
+	if (!(progs->flags & BPF_PROG_ARRAY_F_NON_SLEEPABLE))
+		return 0;
+
+	optval = ctx->optval;
+	max_optlen = sockopt_alloc_buf(ctx, cb_args->max_optlen,
+				       cb_args->buf, false);
+	if (max_optlen < 0)
+		return max_optlen;
+
+	if (copy_from_user(ctx->optval, optval,
+			   min(ctx->optlen, max_optlen)) != 0)
+		return -EFAULT;
+
+	ctx->flags = 0;
+	cb_args->max_optlen = max_optlen;
+
+	return 0;
+}
+
 int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 				       int optname, char __user *optval,
 				       int __user *optlen, int max_optlen,
@@ -1887,15 +2000,16 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 		.optname = optname,
 		.current_task = current,
 	};
+	struct filter_sockopt_cb_args cb_args = {
+		.ctx = &ctx,
+		.buf = &buf,
+		.max_optlen = max_optlen,
+	};
 	int orig_optlen;
 	int ret;
 
 	orig_optlen = max_optlen;
 	ctx.optlen = max_optlen;
-	max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
-	if (max_optlen < 0)
-		return max_optlen;
-
 	if (!retval) {
 		/* If kernel getsockopt finished successfully,
 		 * copy whatever was returned to the user back
@@ -1914,18 +2028,19 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 			goto out;
 		}
 		orig_optlen = ctx.optlen;
-
-		if (copy_from_user(ctx.optval, optval,
-				   min(ctx.optlen, max_optlen)) != 0) {
-			ret = -EFAULT;
-			goto out;
-		}
 	}
 
+	ctx.optval = optval;
+	ctx.optval_end = optval + max_optlen;
+	ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
+
 	lock_sock(sk);
-	ret = bpf_prog_run_array_cg(&cgrp->bpf, CGROUP_GETSOCKOPT,
-				    &ctx, bpf_prog_run, retval, NULL);
+	ret = bpf_prog_run_array_cg_cb(&cgrp->bpf, CGROUP_GETSOCKOPT,
+				       &ctx, bpf_prog_run, retval, NULL,
+				       filter_getsockopt_progs_cb,
+				       &cb_args);
 	release_sock(sk);
+	max_optlen = cb_args.max_optlen;
 
 	if (ret < 0)
 		goto out;
@@ -1942,7 +2057,9 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 	}
 
 	if (ctx.optlen != 0) {
-		if (optval && copy_to_user(optval, ctx.optval, ctx.optlen)) {
+		if (optval &&
+		    !(ctx.flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) &&
+		    copy_to_user(optval, ctx.optval, ctx.optlen)) {
 			ret = -EFAULT;
 			goto out;
 		}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4ccca1f6c998..61be432b9420 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -19549,7 +19549,8 @@ static bool can_be_sleepable(struct bpf_prog *prog)
 	}
 	return prog->type == BPF_PROG_TYPE_LSM ||
 	       prog->type == BPF_PROG_TYPE_KPROBE /* only for uprobes */ ||
-	       prog->type == BPF_PROG_TYPE_STRUCT_OPS;
+	       prog->type == BPF_PROG_TYPE_STRUCT_OPS ||
+	       prog->type == BPF_PROG_TYPE_CGROUP_SOCKOPT;
 }
 
 static int check_attach_btf_id(struct bpf_verifier_env *env)
@@ -19571,7 +19572,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	}
 
 	if (prog->aux->sleepable && !can_be_sleepable(prog)) {
-		verbose(env, "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, and struct_ops programs can be sleepable\n");
+		verbose(env, "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, cgroup, and struct_ops programs can be sleepable\n");
 		return -EINVAL;
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC bpf-next v3 2/5] libbpf: add sleepable sections for {get,set}sockopt()
  2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
  2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
@ 2023-08-15 17:47 ` thinker.li
  2023-08-15 17:47 ` [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval thinker.li
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: thinker.li @ 2023-08-15 17:47 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song
  Cc: sinquersw, kuifeng, Kui-Feng Lee

From: Kui-Feng Lee <thinker.li@gmail.com>

Enable libbpf users to define sleepable programs attached on
{get,set}sockopt().  The sleepable programs should be defined with
SEC("cgroup/getsockopt.s") and SEC("cgroup/setsockopt.s") respectively.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 tools/lib/bpf/libbpf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index b14a4376a86e..ddd6dc166e3e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8766,7 +8766,9 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("cgroup/getsockname6",	CGROUP_SOCK_ADDR, BPF_CGROUP_INET6_GETSOCKNAME, SEC_ATTACHABLE),
 	SEC_DEF("cgroup/sysctl",	CGROUP_SYSCTL, BPF_CGROUP_SYSCTL, SEC_ATTACHABLE),
 	SEC_DEF("cgroup/getsockopt",	CGROUP_SOCKOPT, BPF_CGROUP_GETSOCKOPT, SEC_ATTACHABLE),
+	SEC_DEF("cgroup/getsockopt.s",	CGROUP_SOCKOPT, BPF_CGROUP_GETSOCKOPT, SEC_ATTACHABLE | SEC_SLEEPABLE),
 	SEC_DEF("cgroup/setsockopt",	CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE),
+	SEC_DEF("cgroup/setsockopt.s",	CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE | SEC_SLEEPABLE),
 	SEC_DEF("cgroup/dev",		CGROUP_DEVICE, BPF_CGROUP_DEVICE, SEC_ATTACHABLE_OPT),
 	SEC_DEF("struct_ops+",		STRUCT_OPS, 0, SEC_NONE),
 	SEC_DEF("struct_ops.s+",	STRUCT_OPS, 0, SEC_SLEEPABLE),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval.
  2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
  2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
  2023-08-15 17:47 ` [RFC bpf-next v3 2/5] libbpf: add sleepable sections for {get,set}sockopt() thinker.li
@ 2023-08-15 17:47 ` thinker.li
  2023-08-17  0:55   ` Martin KaFai Lau
  2023-08-17  1:17   ` Alexei Starovoitov
  2023-08-15 17:47 ` [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT thinker.li
  2023-08-15 17:47 ` [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type thinker.li
  4 siblings, 2 replies; 28+ messages in thread
From: thinker.li @ 2023-08-15 17:47 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song
  Cc: sinquersw, kuifeng, Kui-Feng Lee

From: Kui-Feng Lee <thinker.li@gmail.com>

Since the buffer pointed by ctx->user_optval is in user space, BPF programs
in kernel space should not access it directly.  They should use
bpf_copy_from_user() and bpf_copy_to_user() to move data between user and
kernel space.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 kernel/bpf/cgroup.c   | 16 +++++++++--
 kernel/bpf/verifier.c | 66 +++++++++++++++++++++----------------------
 2 files changed, 47 insertions(+), 35 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index b977768a28e5..425094e071ba 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -2494,12 +2494,24 @@ static bool cg_sockopt_is_valid_access(int off, int size,
 	case offsetof(struct bpf_sockopt, optval):
 		if (size != sizeof(__u64))
 			return false;
-		info->reg_type = PTR_TO_PACKET;
+		if (prog->aux->sleepable)
+			/* Prohibit access to the memory pointed by optval
+			 * in sleepable programs.
+			 */
+			info->reg_type = PTR_TO_PACKET | MEM_USER;
+		else
+			info->reg_type = PTR_TO_PACKET;
 		break;
 	case offsetof(struct bpf_sockopt, optval_end):
 		if (size != sizeof(__u64))
 			return false;
-		info->reg_type = PTR_TO_PACKET_END;
+		if (prog->aux->sleepable)
+			/* Prohibit access to the memory pointed by
+			 * optval_end in sleepable programs.
+			 */
+			info->reg_type = PTR_TO_PACKET_END | MEM_USER;
+		else
+			info->reg_type = PTR_TO_PACKET_END;
 		break;
 	case offsetof(struct bpf_sockopt, retval):
 		if (size != size_default)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 61be432b9420..936a171ea976 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -13373,7 +13373,7 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate,
 	 * dst_reg->off is known < MAX_PACKET_OFF, therefore it fits in a u16.
 	 */
 	bpf_for_each_reg_in_vstate(vstate, state, reg, ({
-		if (reg->type == type && reg->id == dst_reg->id)
+		if (base_type(reg->type) == type && reg->id == dst_reg->id)
 			/* keep the maximum range already checked */
 			reg->range = max(reg->range, new_range);
 	}));
@@ -13926,84 +13926,84 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 
 	switch (BPF_OP(insn->code)) {
 	case BPF_JGT:
-		if ((dst_reg->type == PTR_TO_PACKET &&
-		     src_reg->type == PTR_TO_PACKET_END) ||
-		    (dst_reg->type == PTR_TO_PACKET_META &&
+		if ((base_type(dst_reg->type) == PTR_TO_PACKET &&
+		     base_type(src_reg->type) == PTR_TO_PACKET_END) ||
+		    (base_type(dst_reg->type) == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' > pkt_end, pkt_meta' > pkt_data */
 			find_good_pkt_pointers(this_branch, dst_reg,
-					       dst_reg->type, false);
+					       base_type(dst_reg->type), false);
 			mark_pkt_end(other_branch, insn->dst_reg, true);
-		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
-			    src_reg->type == PTR_TO_PACKET) ||
+		} else if ((base_type(dst_reg->type) == PTR_TO_PACKET_END &&
+			    base_type(src_reg->type) == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
-			    src_reg->type == PTR_TO_PACKET_META)) {
+			    base_type(src_reg->type) == PTR_TO_PACKET_META)) {
 			/* pkt_end > pkt_data', pkt_data > pkt_meta' */
 			find_good_pkt_pointers(other_branch, src_reg,
-					       src_reg->type, true);
+					       base_type(src_reg->type), true);
 			mark_pkt_end(this_branch, insn->src_reg, false);
 		} else {
 			return false;
 		}
 		break;
 	case BPF_JLT:
-		if ((dst_reg->type == PTR_TO_PACKET &&
-		     src_reg->type == PTR_TO_PACKET_END) ||
-		    (dst_reg->type == PTR_TO_PACKET_META &&
+		if ((base_type(dst_reg->type) == PTR_TO_PACKET &&
+		     base_type(src_reg->type) == PTR_TO_PACKET_END) ||
+		    (base_type(dst_reg->type) == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' < pkt_end, pkt_meta' < pkt_data */
 			find_good_pkt_pointers(other_branch, dst_reg,
-					       dst_reg->type, true);
+					       base_type(dst_reg->type), true);
 			mark_pkt_end(this_branch, insn->dst_reg, false);
-		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
-			    src_reg->type == PTR_TO_PACKET) ||
+		} else if ((base_type(dst_reg->type) == PTR_TO_PACKET_END &&
+			    base_type(src_reg->type) == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
-			    src_reg->type == PTR_TO_PACKET_META)) {
+			    base_type(src_reg->type) == PTR_TO_PACKET_META)) {
 			/* pkt_end < pkt_data', pkt_data > pkt_meta' */
 			find_good_pkt_pointers(this_branch, src_reg,
-					       src_reg->type, false);
+					       base_type(src_reg->type), false);
 			mark_pkt_end(other_branch, insn->src_reg, true);
 		} else {
 			return false;
 		}
 		break;
 	case BPF_JGE:
-		if ((dst_reg->type == PTR_TO_PACKET &&
-		     src_reg->type == PTR_TO_PACKET_END) ||
-		    (dst_reg->type == PTR_TO_PACKET_META &&
+		if ((base_type(dst_reg->type) == PTR_TO_PACKET &&
+		     base_type(src_reg->type) == PTR_TO_PACKET_END) ||
+		    (base_type(dst_reg->type) == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' >= pkt_end, pkt_meta' >= pkt_data */
 			find_good_pkt_pointers(this_branch, dst_reg,
-					       dst_reg->type, true);
+					       base_type(dst_reg->type), true);
 			mark_pkt_end(other_branch, insn->dst_reg, false);
-		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
-			    src_reg->type == PTR_TO_PACKET) ||
+		} else if ((base_type(dst_reg->type) == PTR_TO_PACKET_END &&
+			    base_type(src_reg->type) == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
-			    src_reg->type == PTR_TO_PACKET_META)) {
+			    base_type(src_reg->type) == PTR_TO_PACKET_META)) {
 			/* pkt_end >= pkt_data', pkt_data >= pkt_meta' */
 			find_good_pkt_pointers(other_branch, src_reg,
-					       src_reg->type, false);
+					       base_type(src_reg->type), false);
 			mark_pkt_end(this_branch, insn->src_reg, true);
 		} else {
 			return false;
 		}
 		break;
 	case BPF_JLE:
-		if ((dst_reg->type == PTR_TO_PACKET &&
-		     src_reg->type == PTR_TO_PACKET_END) ||
-		    (dst_reg->type == PTR_TO_PACKET_META &&
+		if ((base_type(dst_reg->type) == PTR_TO_PACKET &&
+		     base_type(src_reg->type) == PTR_TO_PACKET_END) ||
+		    (base_type(dst_reg->type) == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' <= pkt_end, pkt_meta' <= pkt_data */
 			find_good_pkt_pointers(other_branch, dst_reg,
-					       dst_reg->type, false);
+					       base_type(dst_reg->type), false);
 			mark_pkt_end(this_branch, insn->dst_reg, true);
-		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
-			    src_reg->type == PTR_TO_PACKET) ||
+		} else if ((base_type(dst_reg->type) == PTR_TO_PACKET_END &&
+			    base_type(src_reg->type) == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
-			    src_reg->type == PTR_TO_PACKET_META)) {
+			    base_type(src_reg->type) == PTR_TO_PACKET_META)) {
 			/* pkt_end <= pkt_data', pkt_data <= pkt_meta' */
 			find_good_pkt_pointers(this_branch, src_reg,
-					       src_reg->type, true);
+					       base_type(src_reg->type), true);
 			mark_pkt_end(other_branch, insn->src_reg, false);
 		} else {
 			return false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
                   ` (2 preceding siblings ...)
  2023-08-15 17:47 ` [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval thinker.li
@ 2023-08-15 17:47 ` thinker.li
  2023-08-16 14:19   ` kernel test robot
  2023-08-17  1:25   ` Alexei Starovoitov
  2023-08-15 17:47 ` [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type thinker.li
  4 siblings, 2 replies; 28+ messages in thread
From: thinker.li @ 2023-08-15 17:47 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song
  Cc: sinquersw, kuifeng, Kui-Feng Lee

From: Kui-Feng Lee <thinker.li@gmail.com>

The new dynptr type (BPF_DYNPTR_TYPE_CGROUP_SOCKOPT) will be used by BPF
programs to create a buffer that can be installed on ctx to replace
exisiting optval or user_optval.

Installation is only allowed if ctx->flags &
BPF_SOCKOPT_FLAG_OPTVAL_REPLACE is true. It is enabled only for sleepable
programs on the cgroup/setsockopt hook.  BPF programs can install a new
buffer holding by a dynptr to increase the size of optval passed to
setsockopt().

Installation is not enabled for cgroup/getsockopt since you can not
increased a buffer created, by user program, to return data from
getsockopt().

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h    |   7 +-
 include/linux/filter.h |   4 +
 kernel/bpf/btf.c       |   3 +
 kernel/bpf/cgroup.c    |   5 +-
 kernel/bpf/helpers.c   | 197 +++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c  |  47 +++++++++-
 6 files changed, 259 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index edb35bcfa548..b9e4d7752555 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -663,12 +663,15 @@ enum bpf_type_flag {
 	/* DYNPTR points to xdp_buff */
 	DYNPTR_TYPE_XDP		= BIT(16 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to optval buffer of bpf_sockopt */
+	DYNPTR_TYPE_CGROUP_SOCKOPT = BIT(17 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
 #define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
-				 | DYNPTR_TYPE_XDP)
+				 | DYNPTR_TYPE_XDP | DYNPTR_TYPE_CGROUP_SOCKOPT)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
@@ -1206,6 +1209,8 @@ enum bpf_dynptr_type {
 	BPF_DYNPTR_TYPE_SKB,
 	/* Underlying data is a xdp_buff */
 	BPF_DYNPTR_TYPE_XDP,
+	/* Underlying data is for the optval of a cgroup sock */
+	BPF_DYNPTR_TYPE_CGROUP_SOCKOPT,
 };
 
 int bpf_dynptr_check_size(u32 size);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2aa2a96526de..df12fddd2f21 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1347,6 +1347,10 @@ struct bpf_sockopt_kern {
 enum bpf_sockopt_kern_flags {
 	/* optval is a pointer to user space memory */
 	BPF_SOCKOPT_FLAG_OPTVAL_USER    = (1U << 0),
+	/* able to install new optval */
+	BPF_SOCKOPT_FLAG_OPTVAL_REPLACE	= (1U << 1),
+	/* optval is referenced by a dynptr */
+	BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR	= (1U << 2),
 };
 
 int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len);
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 249657c466dd..6d6a040688be 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -217,6 +217,7 @@ enum btf_kfunc_hook {
 	BTF_KFUNC_HOOK_SOCKET_FILTER,
 	BTF_KFUNC_HOOK_LWT,
 	BTF_KFUNC_HOOK_NETFILTER,
+	BTF_KFUNC_HOOK_CGROUP_SOCKOPT,
 	BTF_KFUNC_HOOK_MAX,
 };
 
@@ -7846,6 +7847,8 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
 		return BTF_KFUNC_HOOK_LWT;
 	case BPF_PROG_TYPE_NETFILTER:
 		return BTF_KFUNC_HOOK_NETFILTER;
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+		return BTF_KFUNC_HOOK_CGROUP_SOCKOPT;
 	default:
 		return BTF_KFUNC_HOOK_MAX;
 	}
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 425094e071ba..164dee8753cf 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -1865,6 +1865,8 @@ static int filter_setsockopt_progs_cb(void *arg,
 	if (max_optlen < 0)
 		return max_optlen;
 
+	ctx->flags = BPF_SOCKOPT_FLAG_OPTVAL_REPLACE;
+
 	if (copy_from_user(ctx->optval, optval,
 			   min(ctx->optlen, max_optlen)) != 0)
 		return -EFAULT;
@@ -1893,7 +1895,8 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 	ctx.optlen = *optlen;
 	ctx.optval = optval;
 	ctx.optval_end = optval + *optlen;
-	ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
+	ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER |
+		BPF_SOCKOPT_FLAG_OPTVAL_REPLACE;
 
 	lock_sock(sk);
 	ret = bpf_prog_run_array_cg_cb(&cgrp->bpf, CGROUP_SETSOCKOPT,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index eb91cae0612a..fc38aff02654 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1537,6 +1537,7 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, const struct bpf_dynptr_kern
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		/* Source and destination may possibly overlap, hence use memmove to
 		 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
 		 * pointing to overlapping PTR_TO_MAP_VALUE regions.
@@ -1582,6 +1583,7 @@ BPF_CALL_5(bpf_dynptr_write, const struct bpf_dynptr_kern *, dst, u32, offset, v
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		if (flags)
 			return -EINVAL;
 		/* Source and destination may possibly overlap, hence use memmove to
@@ -1634,6 +1636,7 @@ BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		return (unsigned long)(ptr->data + ptr->offset + offset);
 	case BPF_DYNPTR_TYPE_SKB:
 	case BPF_DYNPTR_TYPE_XDP:
@@ -2261,6 +2264,7 @@ __bpf_kfunc void *bpf_dynptr_slice(const struct bpf_dynptr_kern *ptr, u32 offset
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		return ptr->data + ptr->offset + offset;
 	case BPF_DYNPTR_TYPE_SKB:
 		if (buffer__opt)
@@ -2429,6 +2433,185 @@ __bpf_kfunc void bpf_rcu_read_unlock(void)
 	rcu_read_unlock();
 }
 
+/* Create a buffer of the given size for a {set,get}sockopt BPF filter.
+ *
+ * This kfunc is only avaliabe for sleeplabe contexts. The dynptr should be
+ * released by bpf_sockopt_dynptr_install() or bpf_sockopt_release().
+ */
+__bpf_kfunc int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
+					 struct bpf_dynptr_kern *ptr__uninit)
+{
+	void *optval;
+	int err;
+
+	bpf_dynptr_set_null(ptr__uninit);
+
+	err = bpf_dynptr_check_size(size);
+	if (err)
+		return err;
+
+	optval = kzalloc(size, GFP_KERNEL);
+	if (!optval)
+		return -ENOMEM;
+
+	bpf_dynptr_init(ptr__uninit, optval,
+			BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, 0, size);
+
+	return size;
+}
+
+/* Install the buffer of the dynptr into the sockopt context.
+ *
+ * This kfunc is only avaliabe for sleeplabe contexts. The dynptr should be
+ * allocated by bpf_sockopt_dynptr_alloc().  The dynptr is invalid after
+ * returning from this function successfully.
+ */
+__bpf_kfunc int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
+					   struct bpf_dynptr_kern *ptr)
+{
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+
+	if (!(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_REPLACE) ||
+	    bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT ||
+	    !ptr->data)
+		return -EINVAL;
+
+	if (sopt_kern->optval == ptr->data &&
+	    !(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)) {
+		/* This dynptr is initialized by bpf_sockopt_dynptr_from()
+		 * and the optval is not overwritten by
+		 * bpf_sockopt_dynptr_install() yet.
+		 */
+		bpf_dynptr_set_null(ptr);
+		sopt_kern->flags &= ~BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR;
+		return 0;
+	}
+
+	if (sopt_kern->optval &&
+	    !(sopt_kern->flags & (BPF_SOCKOPT_FLAG_OPTVAL_USER |
+				  BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR)))
+		kfree(sopt_kern->optval);
+
+	sopt_kern->optval = ptr->data;
+	sopt_kern->optval_end = ptr->data + __bpf_dynptr_size(ptr);
+	sopt_kern->optlen = __bpf_dynptr_size(ptr);
+	sopt_kern->flags &= ~(BPF_SOCKOPT_FLAG_OPTVAL_USER |
+			      BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR);
+
+	bpf_dynptr_set_null(ptr);
+
+	return 0;
+}
+
+__bpf_kfunc int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
+					   struct bpf_dynptr_kern *ptr)
+{
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+
+	if (bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT ||
+	    !ptr->data)
+		return -EINVAL;
+
+	if (sopt_kern->optval == ptr->data &&
+	    !(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER))
+		/* This dynptr is initialized by bpf_sockopt_dynptr_from()
+		 * and the optval is not overwritten by
+		 * bpf_sockopt_dynptr_install() yet.
+		 */
+		sopt_kern->flags &= ~BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR;
+	else
+		kfree(ptr->data);
+	bpf_dynptr_set_null(ptr);
+
+	return 0;
+}
+
+/* Initialize a sockopt dynptr from a user or installed optval pointer.
+ *
+ * sopt->optval can be a user pointer or a kernel pointer. A kernel pointer
+ * can be a buffer allocated by the caller of the BPF program or a buffer
+ * installed by other BPF programs through bpf_sockopt_dynptr_install().
+ *
+ * Atmost one dynptr shall be created by this function at any moment, or
+ * it will return -EINVAL. You can create another dypptr by this function
+ * after release the previous one by bpf_sockopt_dynptr_release().
+ *
+ * A dynptr that is initialized when optval is a user pointer is an
+ * exception. In this case, the dynptr will point to a kernel buffer with
+ * the same content as the user buffer. To simplify the code, users should
+ * always make sure having only one dynptr initialized by this function at
+ * any moment.
+ */
+__bpf_kfunc int bpf_sockopt_dynptr_from(struct bpf_sockopt *sopt,
+					struct bpf_dynptr_kern *ptr__uninit,
+					unsigned int size)
+{
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+	int err;
+
+	bpf_dynptr_set_null(ptr__uninit);
+
+	if (size > (sopt_kern->optval_end - sopt_kern->optval))
+		return -EINVAL;
+
+	if (size == 0)
+		size = min(sopt_kern->optlen,
+			   (int)(sopt_kern->optval_end - sopt_kern->optval));
+
+	if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR)
+		return -EINVAL;
+
+	if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
+		err = bpf_sockopt_dynptr_alloc(sopt, sopt_kern->optlen,
+					       ptr__uninit);
+		if (err >= 0)
+			err = copy_from_user(ptr__uninit->data,
+					     sopt_kern->optval,
+					     size);
+		return err;
+	}
+
+	bpf_dynptr_init(ptr__uninit, sopt_kern->optval,
+			BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, 0,
+			size);
+	sopt_kern->flags |= BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR;
+
+	return size;
+}
+
+/**
+ * int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
+ *				  struct bpf_dynptr_kern *ptr)
+ *     Description
+ *             Copy data from *ptr* to *sopt->optval*.
+ *     Return
+ *             >= 0 on success, or a negative error in case of failure.
+ */
+__bpf_kfunc int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
+					   struct bpf_dynptr_kern *ptr)
+{
+	__u32 size = bpf_dynptr_size(ptr);
+
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+	int ret;
+
+	if (size > (sopt_kern->optval_end - sopt_kern->optval))
+		return -EINVAL;
+
+	if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
+		ret = copy_to_user(sopt_kern->optval, ptr->data,
+				   size);
+		if (unlikely(ret))
+			return -EFAULT;
+	} else {
+		/* Use memmove() in case of optval & ptr overlap. */
+		memmove(sopt_kern->optval, ptr->data, size);
+		ret = size;
+	}
+
+	return ret;
+}
+
 __diag_pop();
 
 BTF_SET8_START(generic_btf_ids)
@@ -2494,6 +2677,19 @@ static const struct btf_kfunc_id_set common_kfunc_set = {
 	.set   = &common_btf_ids,
 };
 
+BTF_SET8_START(cgroup_common_btf_ids)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
+BTF_SET8_END(cgroup_common_btf_ids)
+
+static const struct btf_kfunc_id_set cgroup_kfunc_set = {
+	.owner	= THIS_MODULE,
+	.set	= &cgroup_common_btf_ids,
+};
+
 static int __init kfunc_init(void)
 {
 	int ret;
@@ -2513,6 +2709,7 @@ static int __init kfunc_init(void)
 	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &generic_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCKOPT, &cgroup_kfunc_set);
 	ret = ret ?: register_btf_id_dtor_kfuncs(generic_dtors,
 						  ARRAY_SIZE(generic_dtors),
 						  THIS_MODULE);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 936a171ea976..83d65a6e1309 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -745,6 +745,8 @@ static const char *dynptr_type_str(enum bpf_dynptr_type type)
 		return "skb";
 	case BPF_DYNPTR_TYPE_XDP:
 		return "xdp";
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
+		return "cgroup_sockopt";
 	case BPF_DYNPTR_TYPE_INVALID:
 		return "<invalid>";
 	default:
@@ -826,6 +828,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_SKB;
 	case DYNPTR_TYPE_XDP:
 		return BPF_DYNPTR_TYPE_XDP;
+	case DYNPTR_TYPE_CGROUP_SOCKOPT:
+		return BPF_DYNPTR_TYPE_CGROUP_SOCKOPT;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -842,6 +846,8 @@ static enum bpf_type_flag get_dynptr_type_flag(enum bpf_dynptr_type type)
 		return DYNPTR_TYPE_SKB;
 	case BPF_DYNPTR_TYPE_XDP:
 		return DYNPTR_TYPE_XDP;
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
+		return DYNPTR_TYPE_CGROUP_SOCKOPT;
 	default:
 		return 0;
 	}
@@ -849,7 +855,8 @@ static enum bpf_type_flag get_dynptr_type_flag(enum bpf_dynptr_type type)
 
 static bool dynptr_type_refcounted(enum bpf_dynptr_type type)
 {
-	return type == BPF_DYNPTR_TYPE_RINGBUF;
+	return type == BPF_DYNPTR_TYPE_RINGBUF ||
+		type == BPF_DYNPTR_TYPE_CGROUP_SOCKOPT;
 }
 
 static void __mark_dynptr_reg(struct bpf_reg_state *reg,
@@ -10271,6 +10278,10 @@ enum special_kfunc_type {
 	KF_bpf_dynptr_slice,
 	KF_bpf_dynptr_slice_rdwr,
 	KF_bpf_dynptr_clone,
+	KF_bpf_sockopt_dynptr_alloc,
+	KF_bpf_sockopt_dynptr_install,
+	KF_bpf_sockopt_dynptr_release,
+	KF_bpf_sockopt_dynptr_from,
 };
 
 BTF_SET_START(special_kfunc_set)
@@ -10291,6 +10302,10 @@ BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
+BTF_ID(func, bpf_sockopt_dynptr_alloc)
+BTF_ID(func, bpf_sockopt_dynptr_install)
+BTF_ID(func, bpf_sockopt_dynptr_release)
+BTF_ID(func, bpf_sockopt_dynptr_from)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
@@ -10313,6 +10328,10 @@ BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
+BTF_ID(func, bpf_sockopt_dynptr_alloc)
+BTF_ID(func, bpf_sockopt_dynptr_install)
+BTF_ID(func, bpf_sockopt_dynptr_release)
+BTF_ID(func, bpf_sockopt_dynptr_from)
 
 static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -10966,6 +10985,20 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			arg_type |= OBJ_RELEASE;
 			break;
 		case KF_ARG_PTR_TO_DYNPTR:
+			if (meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_install] ||
+			    meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_release]) {
+				int ref_obj_id = dynptr_ref_obj_id(env, reg);
+
+				if (ref_obj_id < 0) {
+					verbose(env, "R%d is not a valid dynptr\n", regno);
+					return -EINVAL;
+				}
+
+				/* Required by check_func_arg_reg_off() */
+				arg_type |= ARG_PTR_TO_DYNPTR | OBJ_RELEASE;
+				meta->release_regno = regno;
+			}
+			break;
 		case KF_ARG_PTR_TO_ITER:
 		case KF_ARG_PTR_TO_LIST_HEAD:
 		case KF_ARG_PTR_TO_LIST_NODE:
@@ -11053,6 +11086,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 					verbose(env, "verifier internal error: missing ref obj id for parent of clone\n");
 					return -EFAULT;
 				}
+			} else if ((meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_alloc] ||
+				    meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_from]) &&
+				   (dynptr_arg_type & MEM_UNINIT)) {
+				dynptr_arg_type |= DYNPTR_TYPE_CGROUP_SOCKOPT;
 			}
 
 			ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type, clone_ref_obj_id);
@@ -11361,7 +11398,13 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	 * PTR_TO_BTF_ID in bpf_kfunc_arg_meta, do the release now.
 	 */
 	if (meta.release_regno) {
-		err = release_reference(env, regs[meta.release_regno].ref_obj_id);
+		verbose(env, "release refcounted PTR_TO_BTF_ID %s\n",
+			meta.func_name);
+		if (meta.func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_install] ||
+		    meta.func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_release])
+			err = unmark_stack_slots_dynptr(env, &regs[meta.release_regno]);
+		else
+			err = release_reference(env, regs[meta.release_regno].ref_obj_id);
 		if (err) {
 			verbose(env, "kfunc %s#%d reference has not been acquired before\n",
 				func_name, meta.func_id);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type
  2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
                   ` (3 preceding siblings ...)
  2023-08-15 17:47 ` [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT thinker.li
@ 2023-08-15 17:47 ` thinker.li
  2023-08-15 20:57   ` Stanislav Fomichev
  4 siblings, 1 reply; 28+ messages in thread
From: thinker.li @ 2023-08-15 17:47 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song
  Cc: sinquersw, kuifeng, Kui-Feng Lee

From: Kui-Feng Lee <thinker.li@gmail.com>

Do the same test as non-sleepable ones.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 .../testing/selftests/bpf/bpf_experimental.h  |  36 +++
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  41 +++
 .../selftests/bpf/prog_tests/sockopt_sk.c     | 112 +++++++-
 .../testing/selftests/bpf/progs/sockopt_sk.c  | 257 ++++++++++++++++++
 .../selftests/bpf/verifier/sleepable.c        |   2 +-
 5 files changed, 445 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 209811b1993a..9b5dfefe65dc 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -131,4 +131,40 @@ extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *nod
  */
 extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
 
+/*
+ *     Description
+ *             Copy data from *ptr* to *sopt->optval*.
+ *     Return
+ *             >= 0 on success, or a negative error in case of failure.
+ */
+extern int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
+/* Description
+ *	Allocate a buffer of 'size' bytes for being installed as optval.
+ * Returns
+ *	> 0 on success, the size of the allocated buffer
+ *	-ENOMEM or -EINVAL on failure
+ */
+extern int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
+				    struct bpf_dynptr *ptr__uninit) __ksym;
+
+/* Description
+ *	Install the buffer pointed to by 'ptr' as optval.
+ * Returns
+ *	0 on success
+ *	-EINVAL if the buffer is too small
+ */
+extern int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
+/* Description
+ *	Release the buffer allocated by bpf_sockopt_dynptr_alloc.
+ * Returns
+ *	0 on success
+ *	-EINVAL if the buffer was not allocated by bpf_sockopt_dynptr_alloc
+ */
+extern int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
 #endif
diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
index 642dda0e758a..772040225257 100644
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
@@ -41,4 +41,45 @@ extern bool bpf_dynptr_is_rdonly(const struct bpf_dynptr *ptr) __ksym;
 extern __u32 bpf_dynptr_size(const struct bpf_dynptr *ptr) __ksym;
 extern int bpf_dynptr_clone(const struct bpf_dynptr *ptr, struct bpf_dynptr *clone__init) __ksym;
 
+extern int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
+/* Description
+ *	Allocate a buffer of 'size' bytes for being installed as optval.
+ * Returns
+ *	> 0 on success, the size of the allocated buffer
+ *	-ENOMEM or -EINVAL on failure
+ */
+extern int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
+				    struct bpf_dynptr *ptr__uninit) __ksym;
+
+/* Description
+ *	Install the buffer pointed to by 'ptr' as optval.
+ * Returns
+ *	0 on success
+ *	-EINVAL if the buffer is too small
+ */
+extern int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
+/* Description
+ *	Release the buffer allocated by bpf_sockopt_dynptr_alloc.
+ * Returns
+ *	0 on success
+ *	-EINVAL if the buffer was not allocated by bpf_sockopt_dynptr_alloc
+ */
+extern int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
+/* Description
+ *	Initialize a dynptr to access the content of optval passing
+ *      to {get,set}sockopt()s.
+ * Returns
+ *	> 0 on success, the size of the allocated buffer
+ *	-ENOMEM or -EINVAL on failure
+ */
+extern int bpf_sockopt_dynptr_from(struct bpf_sockopt *sopt,
+				   struct bpf_dynptr *ptr__uninit,
+				   unsigned int size) __ksym;
+
 #endif
diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
index 05d0e07da394..85255648747f 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
@@ -92,6 +92,7 @@ static int getsetsockopt(void)
 	}
 	if (buf.u8[0] != 0x01) {
 		log_err("Unexpected buf[0] 0x%02x != 0x01", buf.u8[0]);
+		log_err("optlen %d", optlen);
 		goto err;
 	}
 
@@ -220,7 +221,7 @@ static int getsetsockopt(void)
 	return -1;
 }
 
-static void run_test(int cgroup_fd)
+static void run_test_nonsleepable(int cgroup_fd)
 {
 	struct sockopt_sk *skel;
 
@@ -246,6 +247,106 @@ static void run_test(int cgroup_fd)
 	sockopt_sk__destroy(skel);
 }
 
+static void run_test_nonsleepable_mixed(int cgroup_fd)
+{
+	struct sockopt_sk *skel;
+
+	skel = sockopt_sk__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	skel->bss->page_size = getpagesize();
+	skel->bss->skip_sleepable = 1;
+
+	skel->links._setsockopt_s =
+		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link (sleepable)"))
+		goto cleanup;
+
+	skel->links._getsockopt_s =
+		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link (sleepable)"))
+		goto cleanup;
+
+	skel->links._setsockopt =
+		bpf_program__attach_cgroup(skel->progs._setsockopt, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._setsockopt, "setsockopt_link"))
+		goto cleanup;
+
+	skel->links._getsockopt =
+		bpf_program__attach_cgroup(skel->progs._getsockopt, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._getsockopt, "getsockopt_link"))
+		goto cleanup;
+
+	ASSERT_OK(getsetsockopt(), "getsetsockopt");
+
+cleanup:
+	sockopt_sk__destroy(skel);
+}
+
+static void run_test_sleepable(int cgroup_fd)
+{
+	struct sockopt_sk *skel;
+
+	skel = sockopt_sk__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	skel->bss->page_size = getpagesize();
+
+	skel->links._setsockopt_s =
+		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link"))
+		goto cleanup;
+
+	skel->links._getsockopt_s =
+		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link"))
+		goto cleanup;
+
+	ASSERT_OK(getsetsockopt(), "getsetsockopt");
+
+cleanup:
+	sockopt_sk__destroy(skel);
+}
+
+static void run_test_sleepable_mixed(int cgroup_fd)
+{
+	struct sockopt_sk *skel;
+
+	skel = sockopt_sk__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	skel->bss->page_size = getpagesize();
+	skel->bss->skip_nonsleepable = 1;
+
+	skel->links._setsockopt =
+		bpf_program__attach_cgroup(skel->progs._setsockopt, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._setsockopt, "setsockopt_link (nonsleepable)"))
+		goto cleanup;
+
+	skel->links._getsockopt =
+		bpf_program__attach_cgroup(skel->progs._getsockopt, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._getsockopt, "getsockopt_link (nonsleepable)"))
+		goto cleanup;
+
+	skel->links._setsockopt_s =
+		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link"))
+		goto cleanup;
+
+	skel->links._getsockopt_s =
+		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link"))
+		goto cleanup;
+
+	ASSERT_OK(getsetsockopt(), "getsetsockopt");
+
+cleanup:
+	sockopt_sk__destroy(skel);
+}
+
 void test_sockopt_sk(void)
 {
 	int cgroup_fd;
@@ -254,6 +355,13 @@ void test_sockopt_sk(void)
 	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /sockopt_sk"))
 		return;
 
-	run_test(cgroup_fd);
+	if (test__start_subtest("nonsleepable"))
+		run_test_nonsleepable(cgroup_fd);
+	if (test__start_subtest("sleepable"))
+		run_test_sleepable(cgroup_fd);
+	if (test__start_subtest("nonsleepable_mixed"))
+		run_test_nonsleepable_mixed(cgroup_fd);
+	if (test__start_subtest("sleepable_mixed"))
+		run_test_sleepable_mixed(cgroup_fd);
 	close(cgroup_fd);
 }
diff --git a/tools/testing/selftests/bpf/progs/sockopt_sk.c b/tools/testing/selftests/bpf/progs/sockopt_sk.c
index cb990a7d3d45..efacd3b88c40 100644
--- a/tools/testing/selftests/bpf/progs/sockopt_sk.c
+++ b/tools/testing/selftests/bpf/progs/sockopt_sk.c
@@ -5,10 +5,16 @@
 #include <netinet/in.h>
 #include <bpf/bpf_helpers.h>
 
+typedef int bool;
+#include "bpf_kfuncs.h"
+
 char _license[] SEC("license") = "GPL";
 
 int page_size = 0; /* userspace should set it */
 
+int skip_sleepable = 0;
+int skip_nonsleepable = 0;
+
 #ifndef SOL_TCP
 #define SOL_TCP IPPROTO_TCP
 #endif
@@ -34,6 +40,9 @@ int _getsockopt(struct bpf_sockopt *ctx)
 	struct sockopt_sk *storage;
 	struct bpf_sock *sk;
 
+	if (skip_nonsleepable)
+		return 1;
+
 	/* Bypass AF_NETLINK. */
 	sk = ctx->sk;
 	if (sk && sk->family == AF_NETLINK)
@@ -136,6 +145,134 @@ int _getsockopt(struct bpf_sockopt *ctx)
 	return 1;
 }
 
+SEC("cgroup/getsockopt.s")
+int _getsockopt_s(struct bpf_sockopt *ctx)
+{
+	struct tcp_zerocopy_receive *zcvr;
+	struct bpf_dynptr optval_dynptr;
+	struct sockopt_sk *storage;
+	__u8 *optval, *optval_end;
+	struct bpf_sock *sk;
+	char buf[1];
+	__u64 addr;
+	int ret;
+
+	if (skip_sleepable)
+		return 1;
+
+	/* Bypass AF_NETLINK. */
+	sk = ctx->sk;
+	if (sk && sk->family == AF_NETLINK)
+		return 1;
+
+	optval = ctx->optval;
+	optval_end = ctx->optval_end;
+
+	/* Make sure bpf_get_netns_cookie is callable.
+	 */
+	if (bpf_get_netns_cookie(NULL) == 0)
+		return 0;
+
+	if (bpf_get_netns_cookie(ctx) == 0)
+		return 0;
+
+	if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
+		/* Not interested in SOL_IP:IP_TOS;
+		 * let next BPF program in the cgroup chain or kernel
+		 * handle it.
+		 */
+		return 1;
+	}
+
+	if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
+		/* Not interested in SOL_SOCKET:SO_SNDBUF;
+		 * let next BPF program in the cgroup chain or kernel
+		 * handle it.
+		 */
+		return 1;
+	}
+
+	if (ctx->level == SOL_TCP && ctx->optname == TCP_CONGESTION) {
+		/* Not interested in SOL_TCP:TCP_CONGESTION;
+		 * let next BPF program in the cgroup chain or kernel
+		 * handle it.
+		 */
+		return 1;
+	}
+
+	if (ctx->level == SOL_TCP && ctx->optname == TCP_ZEROCOPY_RECEIVE) {
+		/* Verify that TCP_ZEROCOPY_RECEIVE triggers.
+		 * It has a custom implementation for performance
+		 * reasons.
+		 */
+
+		bpf_sockopt_dynptr_from(ctx, &optval_dynptr, sizeof(*zcvr));
+		zcvr = bpf_dynptr_data(&optval_dynptr, 0, sizeof(*zcvr));
+		addr = zcvr ? zcvr->address : 0;
+		bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
+
+		return addr != 0 ? 0 : 1;
+	}
+
+	if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
+		if (optval + 1 > optval_end)
+			return 0; /* bounds check */
+
+		ctx->retval = 0; /* Reset system call return value to zero */
+
+		/* Always export 0x55 */
+		buf[0] = 0x55;
+		ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
+		if (ret >= 0) {
+			bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
+			ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
+		}
+		bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
+		if (ret < 0)
+			return 0;
+		ctx->optlen = 1;
+
+		/* Userspace buffer is PAGE_SIZE * 2, but BPF
+		 * program can only see the first PAGE_SIZE
+		 * bytes of data.
+		 */
+		if (optval_end - optval != page_size && 0)
+			return 0; /* unexpected data size */
+
+		return 1;
+	}
+
+	if (ctx->level != SOL_CUSTOM)
+		return 0; /* deny everything except custom level */
+
+	if (optval + 1 > optval_end)
+		return 0; /* bounds check */
+
+	storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0,
+				     BPF_SK_STORAGE_GET_F_CREATE);
+	if (!storage)
+		return 0; /* couldn't get sk storage */
+
+	if (!ctx->retval)
+		return 0; /* kernel should not have handled
+			   * SOL_CUSTOM, something is wrong!
+			   */
+	ctx->retval = 0; /* Reset system call return value to zero */
+
+	buf[0] = storage->val;
+	ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
+	if (ret >= 0) {
+		bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
+		ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
+	}
+	bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
+	if (ret < 0)
+		return 0;
+	ctx->optlen = 1;
+
+	return 1;
+}
+
 SEC("cgroup/setsockopt")
 int _setsockopt(struct bpf_sockopt *ctx)
 {
@@ -144,6 +281,9 @@ int _setsockopt(struct bpf_sockopt *ctx)
 	struct sockopt_sk *storage;
 	struct bpf_sock *sk;
 
+	if (skip_nonsleepable)
+		return 1;
+
 	/* Bypass AF_NETLINK. */
 	sk = ctx->sk;
 	if (sk && sk->family == AF_NETLINK)
@@ -236,3 +376,120 @@ int _setsockopt(struct bpf_sockopt *ctx)
 		ctx->optlen = 0;
 	return 1;
 }
+
+SEC("cgroup/setsockopt.s")
+int _setsockopt_s(struct bpf_sockopt *ctx)
+{
+	struct bpf_dynptr optval_buf;
+	struct sockopt_sk *storage;
+	__u8 *optval, *optval_end;
+	struct bpf_sock *sk;
+	__u8 tmp_u8;
+	__u32 tmp;
+	int ret;
+
+	if (skip_sleepable)
+		return 1;
+
+	optval = ctx->optval;
+	optval_end = ctx->optval_end;
+
+	/* Bypass AF_NETLINK. */
+	sk = ctx->sk;
+	if (sk && sk->family == AF_NETLINK)
+		return -1;
+
+	/* Make sure bpf_get_netns_cookie is callable.
+	 */
+	if (bpf_get_netns_cookie(NULL) == 0)
+		return 0;
+
+	if (bpf_get_netns_cookie(ctx) == 0)
+		return 0;
+
+	if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
+		/* Not interested in SOL_IP:IP_TOS;
+		 * let next BPF program in the cgroup chain or kernel
+		 * handle it.
+		 */
+		ctx->optlen = 0; /* bypass optval>PAGE_SIZE */
+		return 1;
+	}
+
+	if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
+		/* Overwrite SO_SNDBUF value */
+
+		ret = bpf_sockopt_dynptr_alloc(ctx, sizeof(__u32),
+					       &optval_buf);
+		if (ret < 0)
+			bpf_sockopt_dynptr_release(ctx, &optval_buf);
+		else {
+			tmp = 0x55AA;
+			bpf_dynptr_write(&optval_buf, 0, &tmp, sizeof(tmp), 0);
+			ret = bpf_sockopt_dynptr_install(ctx, &optval_buf);
+		}
+
+		return ret >= 0 ? 1 : 0;
+	}
+
+	if (ctx->level == SOL_TCP && ctx->optname == TCP_CONGESTION) {
+		/* Always use cubic */
+
+		ret = bpf_sockopt_dynptr_alloc(ctx, 5, &optval_buf);
+		if (ret < 0) {
+			bpf_sockopt_dynptr_release(ctx, &optval_buf);
+			return 0;
+		}
+		bpf_dynptr_write(&optval_buf, 0, "cubic", 5, 0);
+		ret = bpf_sockopt_dynptr_install(ctx, &optval_buf);
+		if (ret < 0)
+			return 0;
+		ctx->optlen = 5;
+
+		return 1;
+	}
+
+	if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
+		/* Original optlen is larger than PAGE_SIZE. */
+		if (ctx->optlen != page_size * 2)
+			return 0; /* unexpected data size */
+
+		ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_buf);
+		if (ret < 0) {
+			bpf_sockopt_dynptr_release(ctx, &optval_buf);
+			return 0;
+		}
+		tmp_u8 = 0;
+		bpf_dynptr_write(&optval_buf, 0, &tmp_u8, 1, 0);
+		ret = bpf_sockopt_dynptr_install(ctx, &optval_buf);
+		if (ret < 0)
+			return 0;
+		ctx->optlen = 1;
+
+		return 1;
+	}
+
+	if (ctx->level != SOL_CUSTOM)
+		return 0; /* deny everything except custom level */
+
+	if (optval + 1 > optval_end)
+		return 0; /* bounds check */
+
+	storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0,
+				     BPF_SK_STORAGE_GET_F_CREATE);
+	if (!storage)
+		return 0; /* couldn't get sk storage */
+
+	bpf_sockopt_dynptr_from(ctx, &optval_buf, sizeof(__u8));
+	optval = bpf_dynptr_data(&optval_buf, 0, sizeof(__u8));
+	if (optval) {
+		storage->val = *optval;
+		ctx->optlen = -1; /* BPF has consumed this option, don't call
+				   * kernel setsockopt handler.
+				   */
+	}
+	bpf_sockopt_dynptr_release(ctx, &optval_buf);
+
+	return optval ? 1 : 0;
+}
+
diff --git a/tools/testing/selftests/bpf/verifier/sleepable.c b/tools/testing/selftests/bpf/verifier/sleepable.c
index 1f0d2bdc673f..4b6c1117ec9f 100644
--- a/tools/testing/selftests/bpf/verifier/sleepable.c
+++ b/tools/testing/selftests/bpf/verifier/sleepable.c
@@ -85,7 +85,7 @@
 	.expected_attach_type = BPF_TRACE_RAW_TP,
 	.kfunc = "sched_switch",
 	.result = REJECT,
-	.errstr = "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, and struct_ops programs can be sleepable",
+	.errstr = "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, cgroup, and struct_ops programs can be sleepable",
 	.flags = BPF_F_SLEEPABLE,
 	.runs = -1,
 },
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type
  2023-08-15 17:47 ` [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type thinker.li
@ 2023-08-15 20:57   ` Stanislav Fomichev
  2023-08-15 23:37     ` Kui-Feng Lee
  0 siblings, 1 reply; 28+ messages in thread
From: Stanislav Fomichev @ 2023-08-15 20:57 UTC (permalink / raw)
  To: thinker.li
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, yonghong.song,
	sinquersw, kuifeng

On 08/15, thinker.li@gmail.com wrote:
> From: Kui-Feng Lee <thinker.li@gmail.com>
> 
> Do the same test as non-sleepable ones.
> 
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  .../testing/selftests/bpf/bpf_experimental.h  |  36 +++
>  tools/testing/selftests/bpf/bpf_kfuncs.h      |  41 +++
>  .../selftests/bpf/prog_tests/sockopt_sk.c     | 112 +++++++-
>  .../testing/selftests/bpf/progs/sockopt_sk.c  | 257 ++++++++++++++++++
>  .../selftests/bpf/verifier/sleepable.c        |   2 +-
>  5 files changed, 445 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> index 209811b1993a..9b5dfefe65dc 100644
> --- a/tools/testing/selftests/bpf/bpf_experimental.h
> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> @@ -131,4 +131,40 @@ extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *nod
>   */
>  extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
>  
> +/*
> + *     Description
> + *             Copy data from *ptr* to *sopt->optval*.
> + *     Return
> + *             >= 0 on success, or a negative error in case of failure.
> + */
> +extern int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
> +				      struct bpf_dynptr *ptr) __ksym;
> +
> +/* Description
> + *	Allocate a buffer of 'size' bytes for being installed as optval.
> + * Returns
> + *	> 0 on success, the size of the allocated buffer
> + *	-ENOMEM or -EINVAL on failure
> + */
> +extern int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
> +				    struct bpf_dynptr *ptr__uninit) __ksym;
> +
> +/* Description
> + *	Install the buffer pointed to by 'ptr' as optval.
> + * Returns
> + *	0 on success
> + *	-EINVAL if the buffer is too small
> + */
> +extern int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
> +				      struct bpf_dynptr *ptr) __ksym;
> +
> +/* Description
> + *	Release the buffer allocated by bpf_sockopt_dynptr_alloc.
> + * Returns
> + *	0 on success
> + *	-EINVAL if the buffer was not allocated by bpf_sockopt_dynptr_alloc
> + */
> +extern int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
> +				      struct bpf_dynptr *ptr) __ksym;
> +
>  #endif
> diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
> index 642dda0e758a..772040225257 100644
> --- a/tools/testing/selftests/bpf/bpf_kfuncs.h
> +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
> @@ -41,4 +41,45 @@ extern bool bpf_dynptr_is_rdonly(const struct bpf_dynptr *ptr) __ksym;
>  extern __u32 bpf_dynptr_size(const struct bpf_dynptr *ptr) __ksym;
>  extern int bpf_dynptr_clone(const struct bpf_dynptr *ptr, struct bpf_dynptr *clone__init) __ksym;
>  
> +extern int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
> +				      struct bpf_dynptr *ptr) __ksym;
> +
> +/* Description
> + *	Allocate a buffer of 'size' bytes for being installed as optval.
> + * Returns
> + *	> 0 on success, the size of the allocated buffer
> + *	-ENOMEM or -EINVAL on failure
> + */
> +extern int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
> +				    struct bpf_dynptr *ptr__uninit) __ksym;
> +
> +/* Description
> + *	Install the buffer pointed to by 'ptr' as optval.
> + * Returns
> + *	0 on success
> + *	-EINVAL if the buffer is too small
> + */
> +extern int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
> +				      struct bpf_dynptr *ptr) __ksym;
> +
> +/* Description
> + *	Release the buffer allocated by bpf_sockopt_dynptr_alloc.
> + * Returns
> + *	0 on success
> + *	-EINVAL if the buffer was not allocated by bpf_sockopt_dynptr_alloc
> + */
> +extern int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
> +				      struct bpf_dynptr *ptr) __ksym;
> +
> +/* Description
> + *	Initialize a dynptr to access the content of optval passing
> + *      to {get,set}sockopt()s.
> + * Returns
> + *	> 0 on success, the size of the allocated buffer
> + *	-ENOMEM or -EINVAL on failure
> + */
> +extern int bpf_sockopt_dynptr_from(struct bpf_sockopt *sopt,
> +				   struct bpf_dynptr *ptr__uninit,
> +				   unsigned int size) __ksym;
> +
>  #endif
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> index 05d0e07da394..85255648747f 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> @@ -92,6 +92,7 @@ static int getsetsockopt(void)
>  	}
>  	if (buf.u8[0] != 0x01) {
>  		log_err("Unexpected buf[0] 0x%02x != 0x01", buf.u8[0]);
> +		log_err("optlen %d", optlen);
>  		goto err;
>  	}
>  
> @@ -220,7 +221,7 @@ static int getsetsockopt(void)
>  	return -1;
>  }
>  
> -static void run_test(int cgroup_fd)
> +static void run_test_nonsleepable(int cgroup_fd)
>  {
>  	struct sockopt_sk *skel;
>  
> @@ -246,6 +247,106 @@ static void run_test(int cgroup_fd)
>  	sockopt_sk__destroy(skel);
>  }
>  
> +static void run_test_nonsleepable_mixed(int cgroup_fd)
> +{
> +	struct sockopt_sk *skel;
> +
> +	skel = sockopt_sk__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_load"))
> +		goto cleanup;
> +
> +	skel->bss->page_size = getpagesize();
> +	skel->bss->skip_sleepable = 1;
> +
> +	skel->links._setsockopt_s =
> +		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link (sleepable)"))
> +		goto cleanup;
> +
> +	skel->links._getsockopt_s =
> +		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link (sleepable)"))
> +		goto cleanup;
> +
> +	skel->links._setsockopt =
> +		bpf_program__attach_cgroup(skel->progs._setsockopt, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._setsockopt, "setsockopt_link"))
> +		goto cleanup;
> +
> +	skel->links._getsockopt =
> +		bpf_program__attach_cgroup(skel->progs._getsockopt, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._getsockopt, "getsockopt_link"))
> +		goto cleanup;
> +
> +	ASSERT_OK(getsetsockopt(), "getsetsockopt");
> +
> +cleanup:
> +	sockopt_sk__destroy(skel);
> +}
> +
> +static void run_test_sleepable(int cgroup_fd)
> +{
> +	struct sockopt_sk *skel;
> +
> +	skel = sockopt_sk__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_load"))
> +		goto cleanup;
> +
> +	skel->bss->page_size = getpagesize();
> +
> +	skel->links._setsockopt_s =
> +		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link"))
> +		goto cleanup;
> +
> +	skel->links._getsockopt_s =
> +		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link"))
> +		goto cleanup;
> +
> +	ASSERT_OK(getsetsockopt(), "getsetsockopt");
> +
> +cleanup:
> +	sockopt_sk__destroy(skel);
> +}
> +
> +static void run_test_sleepable_mixed(int cgroup_fd)
> +{
> +	struct sockopt_sk *skel;
> +
> +	skel = sockopt_sk__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_load"))
> +		goto cleanup;
> +
> +	skel->bss->page_size = getpagesize();
> +	skel->bss->skip_nonsleepable = 1;
> +
> +	skel->links._setsockopt =
> +		bpf_program__attach_cgroup(skel->progs._setsockopt, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._setsockopt, "setsockopt_link (nonsleepable)"))
> +		goto cleanup;
> +
> +	skel->links._getsockopt =
> +		bpf_program__attach_cgroup(skel->progs._getsockopt, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._getsockopt, "getsockopt_link (nonsleepable)"))
> +		goto cleanup;
> +
> +	skel->links._setsockopt_s =
> +		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link"))
> +		goto cleanup;
> +
> +	skel->links._getsockopt_s =
> +		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link"))
> +		goto cleanup;
> +
> +	ASSERT_OK(getsetsockopt(), "getsetsockopt");
> +
> +cleanup:
> +	sockopt_sk__destroy(skel);
> +}
> +
>  void test_sockopt_sk(void)
>  {
>  	int cgroup_fd;
> @@ -254,6 +355,13 @@ void test_sockopt_sk(void)
>  	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /sockopt_sk"))
>  		return;
>  
> -	run_test(cgroup_fd);
> +	if (test__start_subtest("nonsleepable"))
> +		run_test_nonsleepable(cgroup_fd);
> +	if (test__start_subtest("sleepable"))
> +		run_test_sleepable(cgroup_fd);
> +	if (test__start_subtest("nonsleepable_mixed"))
> +		run_test_nonsleepable_mixed(cgroup_fd);
> +	if (test__start_subtest("sleepable_mixed"))
> +		run_test_sleepable_mixed(cgroup_fd);
>  	close(cgroup_fd);
>  }
> diff --git a/tools/testing/selftests/bpf/progs/sockopt_sk.c b/tools/testing/selftests/bpf/progs/sockopt_sk.c
> index cb990a7d3d45..efacd3b88c40 100644
> --- a/tools/testing/selftests/bpf/progs/sockopt_sk.c
> +++ b/tools/testing/selftests/bpf/progs/sockopt_sk.c
> @@ -5,10 +5,16 @@
>  #include <netinet/in.h>
>  #include <bpf/bpf_helpers.h>
>  
> +typedef int bool;
> +#include "bpf_kfuncs.h"
> +
>  char _license[] SEC("license") = "GPL";
>  
>  int page_size = 0; /* userspace should set it */
>  
> +int skip_sleepable = 0;
> +int skip_nonsleepable = 0;
> +
>  #ifndef SOL_TCP
>  #define SOL_TCP IPPROTO_TCP
>  #endif
> @@ -34,6 +40,9 @@ int _getsockopt(struct bpf_sockopt *ctx)
>  	struct sockopt_sk *storage;
>  	struct bpf_sock *sk;
>  
> +	if (skip_nonsleepable)
> +		return 1;
> +
>  	/* Bypass AF_NETLINK. */
>  	sk = ctx->sk;
>  	if (sk && sk->family == AF_NETLINK)
> @@ -136,6 +145,134 @@ int _getsockopt(struct bpf_sockopt *ctx)
>  	return 1;
>  }
>  
> +SEC("cgroup/getsockopt.s")
> +int _getsockopt_s(struct bpf_sockopt *ctx)
> +{
> +	struct tcp_zerocopy_receive *zcvr;
> +	struct bpf_dynptr optval_dynptr;
> +	struct sockopt_sk *storage;
> +	__u8 *optval, *optval_end;
> +	struct bpf_sock *sk;
> +	char buf[1];
> +	__u64 addr;
> +	int ret;
> +
> +	if (skip_sleepable)
> +		return 1;
> +
> +	/* Bypass AF_NETLINK. */
> +	sk = ctx->sk;
> +	if (sk && sk->family == AF_NETLINK)
> +		return 1;
> +
> +	optval = ctx->optval;
> +	optval_end = ctx->optval_end;
> +
> +	/* Make sure bpf_get_netns_cookie is callable.
> +	 */
> +	if (bpf_get_netns_cookie(NULL) == 0)
> +		return 0;
> +
> +	if (bpf_get_netns_cookie(ctx) == 0)
> +		return 0;
> +
> +	if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
> +		/* Not interested in SOL_IP:IP_TOS;
> +		 * let next BPF program in the cgroup chain or kernel
> +		 * handle it.
> +		 */
> +		return 1;
> +	}
> +
> +	if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
> +		/* Not interested in SOL_SOCKET:SO_SNDBUF;
> +		 * let next BPF program in the cgroup chain or kernel
> +		 * handle it.
> +		 */
> +		return 1;
> +	}
> +
> +	if (ctx->level == SOL_TCP && ctx->optname == TCP_CONGESTION) {
> +		/* Not interested in SOL_TCP:TCP_CONGESTION;
> +		 * let next BPF program in the cgroup chain or kernel
> +		 * handle it.
> +		 */
> +		return 1;
> +	}
> +
> +	if (ctx->level == SOL_TCP && ctx->optname == TCP_ZEROCOPY_RECEIVE) {
> +		/* Verify that TCP_ZEROCOPY_RECEIVE triggers.
> +		 * It has a custom implementation for performance
> +		 * reasons.
> +		 */
> +
> +		bpf_sockopt_dynptr_from(ctx, &optval_dynptr, sizeof(*zcvr));
> +		zcvr = bpf_dynptr_data(&optval_dynptr, 0, sizeof(*zcvr));
> +		addr = zcvr ? zcvr->address : 0;
> +		bpf_sockopt_dynptr_release(ctx, &optval_dynptr);

This starts to look more usable, thank you for the changes!
Let me poke the api a bit more, I'm not super familiar with the dynptrs.

here: bpf_sockopt_dynptr_from should probably be called
bpf_dynptr_from_sockopt to match bpf_dynptr_from_mem?

> +
> +		return addr != 0 ? 0 : 1;
> +	}
> +
> +	if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
> +		if (optval + 1 > optval_end)
> +			return 0; /* bounds check */
> +
> +		ctx->retval = 0; /* Reset system call return value to zero */
> +
> +		/* Always export 0x55 */
> +		buf[0] = 0x55;
> +		ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
> +		if (ret >= 0) {
> +			bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
> +			ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
> +		}
> +		bpf_sockopt_dynptr_release(ctx, &optval_dynptr);

Does bpf_sockopt_dynptr_alloc and bpf_sockopt_dynptr_release need to be
sockopt specific? Seems like we should provide, instead, some generic
bpf_dynptr_alloc/bpf_dynptr_release and make
bpf_sockopt_dynptr_copy_to/install work with them? WDYT?

> +		if (ret < 0)
> +			return 0;
> +		ctx->optlen = 1;
> +
> +		/* Userspace buffer is PAGE_SIZE * 2, but BPF
> +		 * program can only see the first PAGE_SIZE
> +		 * bytes of data.
> +		 */
> +		if (optval_end - optval != page_size && 0)
> +			return 0; /* unexpected data size */
> +
> +		return 1;
> +	}
> +
> +	if (ctx->level != SOL_CUSTOM)
> +		return 0; /* deny everything except custom level */
> +
> +	if (optval + 1 > optval_end)
> +		return 0; /* bounds check */
> +
> +	storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0,
> +				     BPF_SK_STORAGE_GET_F_CREATE);
> +	if (!storage)
> +		return 0; /* couldn't get sk storage */
> +
> +	if (!ctx->retval)
> +		return 0; /* kernel should not have handled
> +			   * SOL_CUSTOM, something is wrong!
> +			   */
> +	ctx->retval = 0; /* Reset system call return value to zero */
> +
> +	buf[0] = storage->val;
> +	ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
> +	if (ret >= 0) {
> +		bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
> +		ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
> +	}
> +	bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
> +	if (ret < 0)
> +		return 0;
> +	ctx->optlen = 1;
> +
> +	return 1;
> +}
> +
>  SEC("cgroup/setsockopt")
>  int _setsockopt(struct bpf_sockopt *ctx)
>  {
> @@ -144,6 +281,9 @@ int _setsockopt(struct bpf_sockopt *ctx)
>  	struct sockopt_sk *storage;
>  	struct bpf_sock *sk;
>  
> +	if (skip_nonsleepable)
> +		return 1;
> +
>  	/* Bypass AF_NETLINK. */
>  	sk = ctx->sk;
>  	if (sk && sk->family == AF_NETLINK)
> @@ -236,3 +376,120 @@ int _setsockopt(struct bpf_sockopt *ctx)
>  		ctx->optlen = 0;
>  	return 1;
>  }
> +
> +SEC("cgroup/setsockopt.s")
> +int _setsockopt_s(struct bpf_sockopt *ctx)
> +{
> +	struct bpf_dynptr optval_buf;
> +	struct sockopt_sk *storage;
> +	__u8 *optval, *optval_end;
> +	struct bpf_sock *sk;
> +	__u8 tmp_u8;
> +	__u32 tmp;
> +	int ret;
> +
> +	if (skip_sleepable)
> +		return 1;
> +
> +	optval = ctx->optval;
> +	optval_end = ctx->optval_end;
> +
> +	/* Bypass AF_NETLINK. */
> +	sk = ctx->sk;
> +	if (sk && sk->family == AF_NETLINK)
> +		return -1;
> +
> +	/* Make sure bpf_get_netns_cookie is callable.
> +	 */
> +	if (bpf_get_netns_cookie(NULL) == 0)
> +		return 0;
> +
> +	if (bpf_get_netns_cookie(ctx) == 0)
> +		return 0;
> +
> +	if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
> +		/* Not interested in SOL_IP:IP_TOS;
> +		 * let next BPF program in the cgroup chain or kernel
> +		 * handle it.
> +		 */
> +		ctx->optlen = 0; /* bypass optval>PAGE_SIZE */
> +		return 1;
> +	}
> +
> +	if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
> +		/* Overwrite SO_SNDBUF value */
> +
> +		ret = bpf_sockopt_dynptr_alloc(ctx, sizeof(__u32),
> +					       &optval_buf);
> +		if (ret < 0)
> +			bpf_sockopt_dynptr_release(ctx, &optval_buf);
> +		else {
> +			tmp = 0x55AA;
> +			bpf_dynptr_write(&optval_buf, 0, &tmp, sizeof(tmp), 0);
> +			ret = bpf_sockopt_dynptr_install(ctx, &optval_buf);

One thing I'm still slightly confused about is
bpf_sockopt_dynptr_install vs bpf_sockopt_dynptr_copy_to. I do
understand that it comes from getsockopt vs setsockopt (and the ability,
in setsockopt, to allocate larger buffers), but I wonder whether
we can hide everything under single bpf_sockopt_dynptr_copy_to call?

For getsockopt, it stays as is. For setsockopt, it would work like
_install currently does. Would that work? From user perspective,
if we provide a simple call that does the right thing, seems a bit
more usable? The only problem is probably the fact that _install
explicitly moves the ownership, but I don't see why copy_to can't
have the same "consume" semantics?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
@ 2023-08-15 20:58   ` Stanislav Fomichev
  2023-08-15 21:04     ` Kui-Feng Lee
  2023-08-16  0:42   ` kernel test robot
  1 sibling, 1 reply; 28+ messages in thread
From: Stanislav Fomichev @ 2023-08-15 20:58 UTC (permalink / raw)
  To: thinker.li
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, yonghong.song,
	sinquersw, kuifeng

On 08/15, thinker.li@gmail.com wrote:
> From: Kui-Feng Lee <thinker.li@gmail.com>
> 
> Enable sleepable cgroup/{get,set}sockopt hooks.
> 
> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
> received a pointer to the optval in user space instead of a kernel
> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
> begin and end of the user space buffer if receiving a user space
> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
> a kernel space buffer.
> 
> A program receives a user space buffer if ctx->flags &
> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
> buffer.  The BPF programs should not read/write from/to a user space buffer
> dirrectly.  It should access the buffer through bpf_copy_from_user() and
> bpf_copy_to_user() provided in the following patches.
> 
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  include/linux/bpf.h    |   6 ++
>  include/linux/filter.h |   6 ++
>  kernel/bpf/cgroup.c    | 207 ++++++++++++++++++++++++++++++++---------
>  kernel/bpf/verifier.c  |   5 +-
>  4 files changed, 177 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index cfabbcf47bdb..edb35bcfa548 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1769,9 +1769,15 @@ struct bpf_prog_array_item {
>  
>  struct bpf_prog_array {
>  	struct rcu_head rcu;
> +	u32 flags;
>  	struct bpf_prog_array_item items[];
>  };
>  
> +enum bpf_prog_array_flags {
> +	BPF_PROG_ARRAY_F_SLEEPABLE = 1 << 0,
> +	BPF_PROG_ARRAY_F_NON_SLEEPABLE = 1 << 1,
> +};
> +
>  struct bpf_empty_prog_array {
>  	struct bpf_prog_array hdr;
>  	struct bpf_prog *null_prog;
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 761af6b3cf2b..2aa2a96526de 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1337,12 +1337,18 @@ struct bpf_sockopt_kern {
>  	s32		level;
>  	s32		optname;
>  	s32		optlen;
> +	u32		flags;
>  	/* for retval in struct bpf_cg_run_ctx */
>  	struct task_struct *current_task;
>  	/* Temporary "register" for indirect stores to ppos. */
>  	u64		tmp_reg;
>  };
>  
> +enum bpf_sockopt_kern_flags {
> +	/* optval is a pointer to user space memory */
> +	BPF_SOCKOPT_FLAG_OPTVAL_USER    = (1U << 0),
> +};
> +
>  int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len);
>  
>  struct bpf_sk_lookup_kern {
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 5b2741aa0d9b..b977768a28e5 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -28,25 +28,46 @@ EXPORT_SYMBOL(cgroup_bpf_enabled_key);
>   * function pointer.
>   */
>  static __always_inline int
> -bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
> -		      enum cgroup_bpf_attach_type atype,
> -		      const void *ctx, bpf_prog_run_fn run_prog,
> -		      int retval, u32 *ret_flags)
> +bpf_prog_run_array_cg_cb(const struct cgroup_bpf *cgrp,
> +			 enum cgroup_bpf_attach_type atype,
> +			 const void *ctx, bpf_prog_run_fn run_prog,
> +			 int retval, u32 *ret_flags,
> +			 int (*progs_cb)(void *, const struct bpf_prog_array *),
> +			 void *progs_cb_arg)
>  {
>  	const struct bpf_prog_array_item *item;
>  	const struct bpf_prog *prog;
>  	const struct bpf_prog_array *array;
>  	struct bpf_run_ctx *old_run_ctx;
>  	struct bpf_cg_run_ctx run_ctx;
> +	bool do_sleepable;
>  	u32 func_ret;
> +	int err;
> +
> +	do_sleepable =
> +		atype == CGROUP_SETSOCKOPT || atype == CGROUP_GETSOCKOPT;
>  
>  	run_ctx.retval = retval;
>  	migrate_disable();
> -	rcu_read_lock();
> +	if (do_sleepable) {
> +		might_fault();
> +		rcu_read_lock_trace();
> +	} else
> +		rcu_read_lock();

nit: wrap 'else' branch with {} braces as well

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-08-15 20:58   ` Stanislav Fomichev
@ 2023-08-15 21:04     ` Kui-Feng Lee
  0 siblings, 0 replies; 28+ messages in thread
From: Kui-Feng Lee @ 2023-08-15 21:04 UTC (permalink / raw)
  To: Stanislav Fomichev, thinker.li
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, yonghong.song,
	kuifeng



On 8/15/23 13:58, Stanislav Fomichev wrote:
> On 08/15, thinker.li@gmail.com wrote:
>> From: Kui-Feng Lee <thinker.li@gmail.com>

>> +bpf_prog_run_array_cg_cb(const struct cgroup_bpf *cgrp,
>> +			 enum cgroup_bpf_attach_type atype,
>> +			 const void *ctx, bpf_prog_run_fn run_prog,
>> +			 int retval, u32 *ret_flags,
>> +			 int (*progs_cb)(void *, const struct bpf_prog_array *),
>> +			 void *progs_cb_arg)
>>   {
>>   	const struct bpf_prog_array_item *item;
>>   	const struct bpf_prog *prog;
>>   	const struct bpf_prog_array *array;
>>   	struct bpf_run_ctx *old_run_ctx;
>>   	struct bpf_cg_run_ctx run_ctx;
>> +	bool do_sleepable;
>>   	u32 func_ret;
>> +	int err;
>> +
>> +	do_sleepable =
>> +		atype == CGROUP_SETSOCKOPT || atype == CGROUP_GETSOCKOPT;
>>   
>>   	run_ctx.retval = retval;
>>   	migrate_disable();
>> -	rcu_read_lock();
>> +	if (do_sleepable) {
>> +		might_fault();
>> +		rcu_read_lock_trace();
>> +	} else
>> +		rcu_read_lock();
> 
> nit: wrap 'else' branch with {} braces as well

Got it!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type
  2023-08-15 20:57   ` Stanislav Fomichev
@ 2023-08-15 23:37     ` Kui-Feng Lee
  2023-08-16  0:03       ` Kui-Feng Lee
  0 siblings, 1 reply; 28+ messages in thread
From: Kui-Feng Lee @ 2023-08-15 23:37 UTC (permalink / raw)
  To: Stanislav Fomichev, thinker.li
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, yonghong.song,
	kuifeng



On 8/15/23 13:57, Stanislav Fomichev wrote:
> On 08/15, thinker.li@gmail.com wrote:
>> From: Kui-Feng Lee <thinker.li@gmail.com>
>>
>> Do the same test as non-sleepable ones.
>>
>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>> ---
>>   .../testing/selftests/bpf/bpf_experimental.h  |  36 +++
>>   tools/testing/selftests/bpf/bpf_kfuncs.h      |  41 +++
>>   .../selftests/bpf/prog_tests/sockopt_sk.c     | 112 +++++++-
>>   .../testing/selftests/bpf/progs/sockopt_sk.c  | 257 ++++++++++++++++++
>>   .../selftests/bpf/verifier/sleepable.c        |   2 +-
>>   5 files changed, 445 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
>> index 209811b1993a..9b5dfefe65dc 100644
>> --- a/tools/testing/selftests/bpf/bpf_experimental.h
>> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
>> @@ -131,4 +131,40 @@ extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *nod
>>    */
>>   extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
>>   
>> +/*
>> + *     Description
>> + *             Copy data from *ptr* to *sopt->optval*.
>> + *     Return
>> + *             >= 0 on success, or a negative error in case of failure.
>> + */
>> +extern int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
>> +				      struct bpf_dynptr *ptr) __ksym;
>> +
>> +/* Description
>> + *	Allocate a buffer of 'size' bytes for being installed as optval.
>> + * Returns
>> + *	> 0 on success, the size of the allocated buffer
>> + *	-ENOMEM or -EINVAL on failure
>> + */
>> +extern int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
>> +				    struct bpf_dynptr *ptr__uninit) __ksym;
>> +
>> +/* Description
>> + *	Install the buffer pointed to by 'ptr' as optval.
>> + * Returns
>> + *	0 on success
>> + *	-EINVAL if the buffer is too small
>> + */
>> +extern int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
>> +				      struct bpf_dynptr *ptr) __ksym;
>> +
>> +/* Description
>> + *	Release the buffer allocated by bpf_sockopt_dynptr_alloc.
>> + * Returns
>> + *	0 on success
>> + *	-EINVAL if the buffer was not allocated by bpf_sockopt_dynptr_alloc
>> + */
>> +extern int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
>> +				      struct bpf_dynptr *ptr) __ksym;
>> +
>>   #endif
>> diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
>> index 642dda0e758a..772040225257 100644
>> --- a/tools/testing/selftests/bpf/bpf_kfuncs.h
>> +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
>> @@ -41,4 +41,45 @@ extern bool bpf_dynptr_is_rdonly(const struct bpf_dynptr *ptr) __ksym;
>>   extern __u32 bpf_dynptr_size(const struct bpf_dynptr *ptr) __ksym;
>>   extern int bpf_dynptr_clone(const struct bpf_dynptr *ptr, struct bpf_dynptr *clone__init) __ksym;
>>   
>> +extern int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
>> +				      struct bpf_dynptr *ptr) __ksym;
>> +
>> +/* Description
>> + *	Allocate a buffer of 'size' bytes for being installed as optval.
>> + * Returns
>> + *	> 0 on success, the size of the allocated buffer
>> + *	-ENOMEM or -EINVAL on failure
>> + */
>> +extern int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
>> +				    struct bpf_dynptr *ptr__uninit) __ksym;
>> +
>> +/* Description
>> + *	Install the buffer pointed to by 'ptr' as optval.
>> + * Returns
>> + *	0 on success
>> + *	-EINVAL if the buffer is too small
>> + */
>> +extern int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
>> +				      struct bpf_dynptr *ptr) __ksym;
>> +
>> +/* Description
>> + *	Release the buffer allocated by bpf_sockopt_dynptr_alloc.
>> + * Returns
>> + *	0 on success
>> + *	-EINVAL if the buffer was not allocated by bpf_sockopt_dynptr_alloc
>> + */
>> +extern int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
>> +				      struct bpf_dynptr *ptr) __ksym;
>> +
>> +/* Description
>> + *	Initialize a dynptr to access the content of optval passing
>> + *      to {get,set}sockopt()s.
>> + * Returns
>> + *	> 0 on success, the size of the allocated buffer
>> + *	-ENOMEM or -EINVAL on failure
>> + */
>> +extern int bpf_sockopt_dynptr_from(struct bpf_sockopt *sopt,
>> +				   struct bpf_dynptr *ptr__uninit,
>> +				   unsigned int size) __ksym;
>> +
>>   #endif
>> diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
>> index 05d0e07da394..85255648747f 100644
>> --- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
>> +++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
>> @@ -92,6 +92,7 @@ static int getsetsockopt(void)
>>   	}
>>   	if (buf.u8[0] != 0x01) {
>>   		log_err("Unexpected buf[0] 0x%02x != 0x01", buf.u8[0]);
>> +		log_err("optlen %d", optlen);
>>   		goto err;
>>   	}
>>   
>> @@ -220,7 +221,7 @@ static int getsetsockopt(void)
>>   	return -1;
>>   }
>>   
>> -static void run_test(int cgroup_fd)
>> +static void run_test_nonsleepable(int cgroup_fd)
>>   {
>>   	struct sockopt_sk *skel;
>>   
>> @@ -246,6 +247,106 @@ static void run_test(int cgroup_fd)
>>   	sockopt_sk__destroy(skel);
>>   }
>>   
>> +static void run_test_nonsleepable_mixed(int cgroup_fd)
>> +{
>> +	struct sockopt_sk *skel;
>> +
>> +	skel = sockopt_sk__open_and_load();
>> +	if (!ASSERT_OK_PTR(skel, "skel_load"))
>> +		goto cleanup;
>> +
>> +	skel->bss->page_size = getpagesize();
>> +	skel->bss->skip_sleepable = 1;
>> +
>> +	skel->links._setsockopt_s =
>> +		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link (sleepable)"))
>> +		goto cleanup;
>> +
>> +	skel->links._getsockopt_s =
>> +		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link (sleepable)"))
>> +		goto cleanup;
>> +
>> +	skel->links._setsockopt =
>> +		bpf_program__attach_cgroup(skel->progs._setsockopt, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._setsockopt, "setsockopt_link"))
>> +		goto cleanup;
>> +
>> +	skel->links._getsockopt =
>> +		bpf_program__attach_cgroup(skel->progs._getsockopt, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._getsockopt, "getsockopt_link"))
>> +		goto cleanup;
>> +
>> +	ASSERT_OK(getsetsockopt(), "getsetsockopt");
>> +
>> +cleanup:
>> +	sockopt_sk__destroy(skel);
>> +}
>> +
>> +static void run_test_sleepable(int cgroup_fd)
>> +{
>> +	struct sockopt_sk *skel;
>> +
>> +	skel = sockopt_sk__open_and_load();
>> +	if (!ASSERT_OK_PTR(skel, "skel_load"))
>> +		goto cleanup;
>> +
>> +	skel->bss->page_size = getpagesize();
>> +
>> +	skel->links._setsockopt_s =
>> +		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link"))
>> +		goto cleanup;
>> +
>> +	skel->links._getsockopt_s =
>> +		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link"))
>> +		goto cleanup;
>> +
>> +	ASSERT_OK(getsetsockopt(), "getsetsockopt");
>> +
>> +cleanup:
>> +	sockopt_sk__destroy(skel);
>> +}
>> +
>> +static void run_test_sleepable_mixed(int cgroup_fd)
>> +{
>> +	struct sockopt_sk *skel;
>> +
>> +	skel = sockopt_sk__open_and_load();
>> +	if (!ASSERT_OK_PTR(skel, "skel_load"))
>> +		goto cleanup;
>> +
>> +	skel->bss->page_size = getpagesize();
>> +	skel->bss->skip_nonsleepable = 1;
>> +
>> +	skel->links._setsockopt =
>> +		bpf_program__attach_cgroup(skel->progs._setsockopt, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._setsockopt, "setsockopt_link (nonsleepable)"))
>> +		goto cleanup;
>> +
>> +	skel->links._getsockopt =
>> +		bpf_program__attach_cgroup(skel->progs._getsockopt, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._getsockopt, "getsockopt_link (nonsleepable)"))
>> +		goto cleanup;
>> +
>> +	skel->links._setsockopt_s =
>> +		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link"))
>> +		goto cleanup;
>> +
>> +	skel->links._getsockopt_s =
>> +		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link"))
>> +		goto cleanup;
>> +
>> +	ASSERT_OK(getsetsockopt(), "getsetsockopt");
>> +
>> +cleanup:
>> +	sockopt_sk__destroy(skel);
>> +}
>> +
>>   void test_sockopt_sk(void)
>>   {
>>   	int cgroup_fd;
>> @@ -254,6 +355,13 @@ void test_sockopt_sk(void)
>>   	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /sockopt_sk"))
>>   		return;
>>   
>> -	run_test(cgroup_fd);
>> +	if (test__start_subtest("nonsleepable"))
>> +		run_test_nonsleepable(cgroup_fd);
>> +	if (test__start_subtest("sleepable"))
>> +		run_test_sleepable(cgroup_fd);
>> +	if (test__start_subtest("nonsleepable_mixed"))
>> +		run_test_nonsleepable_mixed(cgroup_fd);
>> +	if (test__start_subtest("sleepable_mixed"))
>> +		run_test_sleepable_mixed(cgroup_fd);
>>   	close(cgroup_fd);
>>   }
>> diff --git a/tools/testing/selftests/bpf/progs/sockopt_sk.c b/tools/testing/selftests/bpf/progs/sockopt_sk.c
>> index cb990a7d3d45..efacd3b88c40 100644
>> --- a/tools/testing/selftests/bpf/progs/sockopt_sk.c
>> +++ b/tools/testing/selftests/bpf/progs/sockopt_sk.c
>> @@ -5,10 +5,16 @@
>>   #include <netinet/in.h>
>>   #include <bpf/bpf_helpers.h>
>>   
>> +typedef int bool;
>> +#include "bpf_kfuncs.h"
>> +
>>   char _license[] SEC("license") = "GPL";
>>   
>>   int page_size = 0; /* userspace should set it */
>>   
>> +int skip_sleepable = 0;
>> +int skip_nonsleepable = 0;
>> +
>>   #ifndef SOL_TCP
>>   #define SOL_TCP IPPROTO_TCP
>>   #endif
>> @@ -34,6 +40,9 @@ int _getsockopt(struct bpf_sockopt *ctx)
>>   	struct sockopt_sk *storage;
>>   	struct bpf_sock *sk;
>>   
>> +	if (skip_nonsleepable)
>> +		return 1;
>> +
>>   	/* Bypass AF_NETLINK. */
>>   	sk = ctx->sk;
>>   	if (sk && sk->family == AF_NETLINK)
>> @@ -136,6 +145,134 @@ int _getsockopt(struct bpf_sockopt *ctx)
>>   	return 1;
>>   }
>>   
>> +SEC("cgroup/getsockopt.s")
>> +int _getsockopt_s(struct bpf_sockopt *ctx)
>> +{
>> +	struct tcp_zerocopy_receive *zcvr;
>> +	struct bpf_dynptr optval_dynptr;
>> +	struct sockopt_sk *storage;
>> +	__u8 *optval, *optval_end;
>> +	struct bpf_sock *sk;
>> +	char buf[1];
>> +	__u64 addr;
>> +	int ret;
>> +
>> +	if (skip_sleepable)
>> +		return 1;
>> +
>> +	/* Bypass AF_NETLINK. */
>> +	sk = ctx->sk;
>> +	if (sk && sk->family == AF_NETLINK)
>> +		return 1;
>> +
>> +	optval = ctx->optval;
>> +	optval_end = ctx->optval_end;
>> +
>> +	/* Make sure bpf_get_netns_cookie is callable.
>> +	 */
>> +	if (bpf_get_netns_cookie(NULL) == 0)
>> +		return 0;
>> +
>> +	if (bpf_get_netns_cookie(ctx) == 0)
>> +		return 0;
>> +
>> +	if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
>> +		/* Not interested in SOL_IP:IP_TOS;
>> +		 * let next BPF program in the cgroup chain or kernel
>> +		 * handle it.
>> +		 */
>> +		return 1;
>> +	}
>> +
>> +	if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
>> +		/* Not interested in SOL_SOCKET:SO_SNDBUF;
>> +		 * let next BPF program in the cgroup chain or kernel
>> +		 * handle it.
>> +		 */
>> +		return 1;
>> +	}
>> +
>> +	if (ctx->level == SOL_TCP && ctx->optname == TCP_CONGESTION) {
>> +		/* Not interested in SOL_TCP:TCP_CONGESTION;
>> +		 * let next BPF program in the cgroup chain or kernel
>> +		 * handle it.
>> +		 */
>> +		return 1;
>> +	}
>> +
>> +	if (ctx->level == SOL_TCP && ctx->optname == TCP_ZEROCOPY_RECEIVE) {
>> +		/* Verify that TCP_ZEROCOPY_RECEIVE triggers.
>> +		 * It has a custom implementation for performance
>> +		 * reasons.
>> +		 */
>> +
>> +		bpf_sockopt_dynptr_from(ctx, &optval_dynptr, sizeof(*zcvr));
>> +		zcvr = bpf_dynptr_data(&optval_dynptr, 0, sizeof(*zcvr));
>> +		addr = zcvr ? zcvr->address : 0;
>> +		bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
> 
> This starts to look more usable, thank you for the changes!
> Let me poke the api a bit more, I'm not super familiar with the dynptrs.
> 
> here: bpf_sockopt_dynptr_from should probably be called
> bpf_dynptr_from_sockopt to match bpf_dynptr_from_mem?

agree!

> 
>> +
>> +		return addr != 0 ? 0 : 1;
>> +	}
>> +
>> +	if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
>> +		if (optval + 1 > optval_end)
>> +			return 0; /* bounds check */
>> +
>> +		ctx->retval = 0; /* Reset system call return value to zero */
>> +
>> +		/* Always export 0x55 */
>> +		buf[0] = 0x55;
>> +		ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
>> +		if (ret >= 0) {
>> +			bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
>> +			ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
>> +		}
>> +		bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
> 
> Does bpf_sockopt_dynptr_alloc and bpf_sockopt_dynptr_release need to be
> sockopt specific? Seems like we should provide, instead, some generic
> bpf_dynptr_alloc/bpf_dynptr_release and make
> bpf_sockopt_dynptr_copy_to/install work with them? WDYT?

I found that kmalloc can not be called in the context of nmi, slab or
page alloc path. It is why we don't have functions like
bpf_dynptr_alloc/bpf_dynptr_release yet. That means we need someone
to implement an allocator for BPF programs. And, then, we can have
bpf_dynptr_alloc unpon it. (There is an implementation of
bpf_dynptr_alloc() in the early versions of the patchset of dynptr.
But, be removed before landing.)


> 
>> +		if (ret < 0)
>> +			return 0;
>> +		ctx->optlen = 1;
>> +
>> +		/* Userspace buffer is PAGE_SIZE * 2, but BPF
>> +		 * program can only see the first PAGE_SIZE
>> +		 * bytes of data.
>> +		 */
>> +		if (optval_end - optval != page_size && 0)
>> +			return 0; /* unexpected data size */
>> +
>> +		return 1;
>> +	}
>> +
>> +	if (ctx->level != SOL_CUSTOM)
>> +		return 0; /* deny everything except custom level */
>> +
>> +	if (optval + 1 > optval_end)
>> +		return 0; /* bounds check */
>> +
>> +	storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0,
>> +				     BPF_SK_STORAGE_GET_F_CREATE);
>> +	if (!storage)
>> +		return 0; /* couldn't get sk storage */
>> +
>> +	if (!ctx->retval)
>> +		return 0; /* kernel should not have handled
>> +			   * SOL_CUSTOM, something is wrong!
>> +			   */
>> +	ctx->retval = 0; /* Reset system call return value to zero */
>> +
>> +	buf[0] = storage->val;
>> +	ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
>> +	if (ret >= 0) {
>> +		bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
>> +		ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
>> +	}
>> +	bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>> +	if (ret < 0)
>> +		return 0;
>> +	ctx->optlen = 1;
>> +
>> +	return 1;
>> +}
>> +
>>   SEC("cgroup/setsockopt")
>>   int _setsockopt(struct bpf_sockopt *ctx)
>>   {
>> @@ -144,6 +281,9 @@ int _setsockopt(struct bpf_sockopt *ctx)
>>   	struct sockopt_sk *storage;
>>   	struct bpf_sock *sk;
>>   
>> +	if (skip_nonsleepable)
>> +		return 1;
>> +
>>   	/* Bypass AF_NETLINK. */
>>   	sk = ctx->sk;
>>   	if (sk && sk->family == AF_NETLINK)
>> @@ -236,3 +376,120 @@ int _setsockopt(struct bpf_sockopt *ctx)
>>   		ctx->optlen = 0;
>>   	return 1;
>>   }
>> +
>> +SEC("cgroup/setsockopt.s")
>> +int _setsockopt_s(struct bpf_sockopt *ctx)
>> +{
>> +	struct bpf_dynptr optval_buf;
>> +	struct sockopt_sk *storage;
>> +	__u8 *optval, *optval_end;
>> +	struct bpf_sock *sk;
>> +	__u8 tmp_u8;
>> +	__u32 tmp;
>> +	int ret;
>> +
>> +	if (skip_sleepable)
>> +		return 1;
>> +
>> +	optval = ctx->optval;
>> +	optval_end = ctx->optval_end;
>> +
>> +	/* Bypass AF_NETLINK. */
>> +	sk = ctx->sk;
>> +	if (sk && sk->family == AF_NETLINK)
>> +		return -1;
>> +
>> +	/* Make sure bpf_get_netns_cookie is callable.
>> +	 */
>> +	if (bpf_get_netns_cookie(NULL) == 0)
>> +		return 0;
>> +
>> +	if (bpf_get_netns_cookie(ctx) == 0)
>> +		return 0;
>> +
>> +	if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
>> +		/* Not interested in SOL_IP:IP_TOS;
>> +		 * let next BPF program in the cgroup chain or kernel
>> +		 * handle it.
>> +		 */
>> +		ctx->optlen = 0; /* bypass optval>PAGE_SIZE */
>> +		return 1;
>> +	}
>> +
>> +	if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
>> +		/* Overwrite SO_SNDBUF value */
>> +
>> +		ret = bpf_sockopt_dynptr_alloc(ctx, sizeof(__u32),
>> +					       &optval_buf);
>> +		if (ret < 0)
>> +			bpf_sockopt_dynptr_release(ctx, &optval_buf);
>> +		else {
>> +			tmp = 0x55AA;
>> +			bpf_dynptr_write(&optval_buf, 0, &tmp, sizeof(tmp), 0);
>> +			ret = bpf_sockopt_dynptr_install(ctx, &optval_buf);
> 
> One thing I'm still slightly confused about is
> bpf_sockopt_dynptr_install vs bpf_sockopt_dynptr_copy_to. I do
> understand that it comes from getsockopt vs setsockopt (and the ability,
> in setsockopt, to allocate larger buffers), but I wonder whether
> we can hide everything under single bpf_sockopt_dynptr_copy_to call?
> 
> For getsockopt, it stays as is. For setsockopt, it would work like
> _install currently does. Would that work? From user perspective,
> if we provide a simple call that does the right thing, seems a bit
> more usable? The only problem is probably the fact that _install
> explicitly moves the ownership, but I don't see why copy_to can't
> have the same "consume" semantics?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type
  2023-08-15 23:37     ` Kui-Feng Lee
@ 2023-08-16  0:03       ` Kui-Feng Lee
  2023-08-17  1:13         ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Kui-Feng Lee @ 2023-08-16  0:03 UTC (permalink / raw)
  To: Stanislav Fomichev, thinker.li
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, yonghong.song,
	kuifeng



On 8/15/23 16:37, Kui-Feng Lee wrote:
> 
> 
> On 8/15/23 13:57, Stanislav Fomichev wrote:
>> On 08/15, thinker.li@gmail.com wrote:
>>> From: Kui-Feng Lee <thinker.li@gmail.com>
>>>
>>> Do the same test as non-sleepable ones.
>>>
>>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>>> ---
>>>   .../testing/selftests/bpf/bpf_experimental.h  |  36 +++
>>>   tools/testing/selftests/bpf/bpf_kfuncs.h      |  41 +++
>>>   .../selftests/bpf/prog_tests/sockopt_sk.c     | 112 +++++++-
>>>   .../testing/selftests/bpf/progs/sockopt_sk.c  | 257 ++++++++++++++++++
>>>   .../selftests/bpf/verifier/sleepable.c        |   2 +-
>>>   5 files changed, 445 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h 
>>> b/tools/testing/selftests/bpf/bpf_experimental.h
>>> index 209811b1993a..9b5dfefe65dc 100644
>>> --- a/tools/testing/selftests/bpf/bpf_experimental.h
>>> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
>>> @@ -131,4 +131,40 @@ extern int bpf_rbtree_add_impl(struct 
>>> bpf_rb_root *root, struct bpf_rb_node *nod
>>>    */
>>>   extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root 
>>> *root) __ksym;
>>> +/*
>>> + *     Description
>>> + *             Copy data from *ptr* to *sopt->optval*.
>>> + *     Return
>>> + *             >= 0 on success, or a negative error in case of failure.
>>> + */
>>> +extern int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
>>> +                      struct bpf_dynptr *ptr) __ksym;
>>> +
>>> +/* Description
>>> + *    Allocate a buffer of 'size' bytes for being installed as optval.
>>> + * Returns
>>> + *    > 0 on success, the size of the allocated buffer
>>> + *    -ENOMEM or -EINVAL on failure
>>> + */
>>> +extern int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
>>> +                    struct bpf_dynptr *ptr__uninit) __ksym;
>>> +
>>> +/* Description
>>> + *    Install the buffer pointed to by 'ptr' as optval.
>>> + * Returns
>>> + *    0 on success
>>> + *    -EINVAL if the buffer is too small
>>> + */
>>> +extern int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
>>> +                      struct bpf_dynptr *ptr) __ksym;
>>> +
>>> +/* Description
>>> + *    Release the buffer allocated by bpf_sockopt_dynptr_alloc.
>>> + * Returns
>>> + *    0 on success
>>> + *    -EINVAL if the buffer was not allocated by 
>>> bpf_sockopt_dynptr_alloc
>>> + */
>>> +extern int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
>>> +                      struct bpf_dynptr *ptr) __ksym;
>>> +
>>>   #endif
>>> diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h 
>>> b/tools/testing/selftests/bpf/bpf_kfuncs.h
>>> index 642dda0e758a..772040225257 100644
>>> --- a/tools/testing/selftests/bpf/bpf_kfuncs.h
>>> +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
>>> @@ -41,4 +41,45 @@ extern bool bpf_dynptr_is_rdonly(const struct 
>>> bpf_dynptr *ptr) __ksym;
>>>   extern __u32 bpf_dynptr_size(const struct bpf_dynptr *ptr) __ksym;
>>>   extern int bpf_dynptr_clone(const struct bpf_dynptr *ptr, struct 
>>> bpf_dynptr *clone__init) __ksym;
>>> +extern int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
>>> +                      struct bpf_dynptr *ptr) __ksym;
>>> +
>>> +/* Description
>>> + *    Allocate a buffer of 'size' bytes for being installed as optval.
>>> + * Returns
>>> + *    > 0 on success, the size of the allocated buffer
>>> + *    -ENOMEM or -EINVAL on failure
>>> + */
>>> +extern int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
>>> +                    struct bpf_dynptr *ptr__uninit) __ksym;
>>> +
>>> +/* Description
>>> + *    Install the buffer pointed to by 'ptr' as optval.
>>> + * Returns
>>> + *    0 on success
>>> + *    -EINVAL if the buffer is too small
>>> + */
>>> +extern int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
>>> +                      struct bpf_dynptr *ptr) __ksym;
>>> +
>>> +/* Description
>>> + *    Release the buffer allocated by bpf_sockopt_dynptr_alloc.
>>> + * Returns
>>> + *    0 on success
>>> + *    -EINVAL if the buffer was not allocated by 
>>> bpf_sockopt_dynptr_alloc
>>> + */
>>> +extern int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
>>> +                      struct bpf_dynptr *ptr) __ksym;
>>> +
>>> +/* Description
>>> + *    Initialize a dynptr to access the content of optval passing
>>> + *      to {get,set}sockopt()s.
>>> + * Returns
>>> + *    > 0 on success, the size of the allocated buffer
>>> + *    -ENOMEM or -EINVAL on failure
>>> + */
>>> +extern int bpf_sockopt_dynptr_from(struct bpf_sockopt *sopt,
>>> +                   struct bpf_dynptr *ptr__uninit,
>>> +                   unsigned int size) __ksym;
>>> +
>>>   #endif
>>> diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c 
>>> b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
>>> index 05d0e07da394..85255648747f 100644
>>> --- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
>>> +++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
>>> @@ -92,6 +92,7 @@ static int getsetsockopt(void)
>>>       }
>>>       if (buf.u8[0] != 0x01) {
>>>           log_err("Unexpected buf[0] 0x%02x != 0x01", buf.u8[0]);
>>> +        log_err("optlen %d", optlen);
>>>           goto err;
>>>       }
>>> @@ -220,7 +221,7 @@ static int getsetsockopt(void)
>>>       return -1;
>>>   }
>>> -static void run_test(int cgroup_fd)
>>> +static void run_test_nonsleepable(int cgroup_fd)
>>>   {
>>>       struct sockopt_sk *skel;
>>> @@ -246,6 +247,106 @@ static void run_test(int cgroup_fd)
>>>       sockopt_sk__destroy(skel);
>>>   }
>>> +static void run_test_nonsleepable_mixed(int cgroup_fd)
>>> +{
>>> +    struct sockopt_sk *skel;
>>> +
>>> +    skel = sockopt_sk__open_and_load();
>>> +    if (!ASSERT_OK_PTR(skel, "skel_load"))
>>> +        goto cleanup;
>>> +
>>> +    skel->bss->page_size = getpagesize();
>>> +    skel->bss->skip_sleepable = 1;
>>> +
>>> +    skel->links._setsockopt_s =
>>> +        bpf_program__attach_cgroup(skel->progs._setsockopt_s, 
>>> cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link 
>>> (sleepable)"))
>>> +        goto cleanup;
>>> +
>>> +    skel->links._getsockopt_s =
>>> +        bpf_program__attach_cgroup(skel->progs._getsockopt_s, 
>>> cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link 
>>> (sleepable)"))
>>> +        goto cleanup;
>>> +
>>> +    skel->links._setsockopt =
>>> +        bpf_program__attach_cgroup(skel->progs._setsockopt, cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._setsockopt, "setsockopt_link"))
>>> +        goto cleanup;
>>> +
>>> +    skel->links._getsockopt =
>>> +        bpf_program__attach_cgroup(skel->progs._getsockopt, cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._getsockopt, "getsockopt_link"))
>>> +        goto cleanup;
>>> +
>>> +    ASSERT_OK(getsetsockopt(), "getsetsockopt");
>>> +
>>> +cleanup:
>>> +    sockopt_sk__destroy(skel);
>>> +}
>>> +
>>> +static void run_test_sleepable(int cgroup_fd)
>>> +{
>>> +    struct sockopt_sk *skel;
>>> +
>>> +    skel = sockopt_sk__open_and_load();
>>> +    if (!ASSERT_OK_PTR(skel, "skel_load"))
>>> +        goto cleanup;
>>> +
>>> +    skel->bss->page_size = getpagesize();
>>> +
>>> +    skel->links._setsockopt_s =
>>> +        bpf_program__attach_cgroup(skel->progs._setsockopt_s, 
>>> cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link"))
>>> +        goto cleanup;
>>> +
>>> +    skel->links._getsockopt_s =
>>> +        bpf_program__attach_cgroup(skel->progs._getsockopt_s, 
>>> cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link"))
>>> +        goto cleanup;
>>> +
>>> +    ASSERT_OK(getsetsockopt(), "getsetsockopt");
>>> +
>>> +cleanup:
>>> +    sockopt_sk__destroy(skel);
>>> +}
>>> +
>>> +static void run_test_sleepable_mixed(int cgroup_fd)
>>> +{
>>> +    struct sockopt_sk *skel;
>>> +
>>> +    skel = sockopt_sk__open_and_load();
>>> +    if (!ASSERT_OK_PTR(skel, "skel_load"))
>>> +        goto cleanup;
>>> +
>>> +    skel->bss->page_size = getpagesize();
>>> +    skel->bss->skip_nonsleepable = 1;
>>> +
>>> +    skel->links._setsockopt =
>>> +        bpf_program__attach_cgroup(skel->progs._setsockopt, cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._setsockopt, "setsockopt_link 
>>> (nonsleepable)"))
>>> +        goto cleanup;
>>> +
>>> +    skel->links._getsockopt =
>>> +        bpf_program__attach_cgroup(skel->progs._getsockopt, cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._getsockopt, "getsockopt_link 
>>> (nonsleepable)"))
>>> +        goto cleanup;
>>> +
>>> +    skel->links._setsockopt_s =
>>> +        bpf_program__attach_cgroup(skel->progs._setsockopt_s, 
>>> cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link"))
>>> +        goto cleanup;
>>> +
>>> +    skel->links._getsockopt_s =
>>> +        bpf_program__attach_cgroup(skel->progs._getsockopt_s, 
>>> cgroup_fd);
>>> +    if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link"))
>>> +        goto cleanup;
>>> +
>>> +    ASSERT_OK(getsetsockopt(), "getsetsockopt");
>>> +
>>> +cleanup:
>>> +    sockopt_sk__destroy(skel);
>>> +}
>>> +
>>>   void test_sockopt_sk(void)
>>>   {
>>>       int cgroup_fd;
>>> @@ -254,6 +355,13 @@ void test_sockopt_sk(void)
>>>       if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /sockopt_sk"))
>>>           return;
>>> -    run_test(cgroup_fd);
>>> +    if (test__start_subtest("nonsleepable"))
>>> +        run_test_nonsleepable(cgroup_fd);
>>> +    if (test__start_subtest("sleepable"))
>>> +        run_test_sleepable(cgroup_fd);
>>> +    if (test__start_subtest("nonsleepable_mixed"))
>>> +        run_test_nonsleepable_mixed(cgroup_fd);
>>> +    if (test__start_subtest("sleepable_mixed"))
>>> +        run_test_sleepable_mixed(cgroup_fd);
>>>       close(cgroup_fd);
>>>   }
>>> diff --git a/tools/testing/selftests/bpf/progs/sockopt_sk.c 
>>> b/tools/testing/selftests/bpf/progs/sockopt_sk.c
>>> index cb990a7d3d45..efacd3b88c40 100644
>>> --- a/tools/testing/selftests/bpf/progs/sockopt_sk.c
>>> +++ b/tools/testing/selftests/bpf/progs/sockopt_sk.c
>>> @@ -5,10 +5,16 @@
>>>   #include <netinet/in.h>
>>>   #include <bpf/bpf_helpers.h>
>>> +typedef int bool;
>>> +#include "bpf_kfuncs.h"
>>> +
>>>   char _license[] SEC("license") = "GPL";
>>>   int page_size = 0; /* userspace should set it */
>>> +int skip_sleepable = 0;
>>> +int skip_nonsleepable = 0;
>>> +
>>>   #ifndef SOL_TCP
>>>   #define SOL_TCP IPPROTO_TCP
>>>   #endif
>>> @@ -34,6 +40,9 @@ int _getsockopt(struct bpf_sockopt *ctx)
>>>       struct sockopt_sk *storage;
>>>       struct bpf_sock *sk;
>>> +    if (skip_nonsleepable)
>>> +        return 1;
>>> +
>>>       /* Bypass AF_NETLINK. */
>>>       sk = ctx->sk;
>>>       if (sk && sk->family == AF_NETLINK)
>>> @@ -136,6 +145,134 @@ int _getsockopt(struct bpf_sockopt *ctx)
>>>       return 1;
>>>   }
>>> +SEC("cgroup/getsockopt.s")
>>> +int _getsockopt_s(struct bpf_sockopt *ctx)
>>> +{
>>> +    struct tcp_zerocopy_receive *zcvr;
>>> +    struct bpf_dynptr optval_dynptr;
>>> +    struct sockopt_sk *storage;
>>> +    __u8 *optval, *optval_end;
>>> +    struct bpf_sock *sk;
>>> +    char buf[1];
>>> +    __u64 addr;
>>> +    int ret;
>>> +
>>> +    if (skip_sleepable)
>>> +        return 1;
>>> +
>>> +    /* Bypass AF_NETLINK. */
>>> +    sk = ctx->sk;
>>> +    if (sk && sk->family == AF_NETLINK)
>>> +        return 1;
>>> +
>>> +    optval = ctx->optval;
>>> +    optval_end = ctx->optval_end;
>>> +
>>> +    /* Make sure bpf_get_netns_cookie is callable.
>>> +     */
>>> +    if (bpf_get_netns_cookie(NULL) == 0)
>>> +        return 0;
>>> +
>>> +    if (bpf_get_netns_cookie(ctx) == 0)
>>> +        return 0;
>>> +
>>> +    if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
>>> +        /* Not interested in SOL_IP:IP_TOS;
>>> +         * let next BPF program in the cgroup chain or kernel
>>> +         * handle it.
>>> +         */
>>> +        return 1;
>>> +    }
>>> +
>>> +    if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
>>> +        /* Not interested in SOL_SOCKET:SO_SNDBUF;
>>> +         * let next BPF program in the cgroup chain or kernel
>>> +         * handle it.
>>> +         */
>>> +        return 1;
>>> +    }
>>> +
>>> +    if (ctx->level == SOL_TCP && ctx->optname == TCP_CONGESTION) {
>>> +        /* Not interested in SOL_TCP:TCP_CONGESTION;
>>> +         * let next BPF program in the cgroup chain or kernel
>>> +         * handle it.
>>> +         */
>>> +        return 1;
>>> +    }
>>> +
>>> +    if (ctx->level == SOL_TCP && ctx->optname == 
>>> TCP_ZEROCOPY_RECEIVE) {
>>> +        /* Verify that TCP_ZEROCOPY_RECEIVE triggers.
>>> +         * It has a custom implementation for performance
>>> +         * reasons.
>>> +         */
>>> +
>>> +        bpf_sockopt_dynptr_from(ctx, &optval_dynptr, sizeof(*zcvr));
>>> +        zcvr = bpf_dynptr_data(&optval_dynptr, 0, sizeof(*zcvr));
>>> +        addr = zcvr ? zcvr->address : 0;
>>> +        bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>>
>> This starts to look more usable, thank you for the changes!
>> Let me poke the api a bit more, I'm not super familiar with the dynptrs.
>>
>> here: bpf_sockopt_dynptr_from should probably be called
>> bpf_dynptr_from_sockopt to match bpf_dynptr_from_mem?
> 
> agree!
> 
>>
>>> +
>>> +        return addr != 0 ? 0 : 1;
>>> +    }
>>> +
>>> +    if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
>>> +        if (optval + 1 > optval_end)
>>> +            return 0; /* bounds check */
>>> +
>>> +        ctx->retval = 0; /* Reset system call return value to zero */
>>> +
>>> +        /* Always export 0x55 */
>>> +        buf[0] = 0x55;
>>> +        ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
>>> +        if (ret >= 0) {
>>> +            bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
>>> +            ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
>>> +        }
>>> +        bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>>
>> Does bpf_sockopt_dynptr_alloc and bpf_sockopt_dynptr_release need to be
>> sockopt specific? Seems like we should provide, instead, some generic
>> bpf_dynptr_alloc/bpf_dynptr_release and make
>> bpf_sockopt_dynptr_copy_to/install work with them? WDYT?
> 
> I found that kmalloc can not be called in the context of nmi, slab or
> page alloc path. It is why we don't have functions like
> bpf_dynptr_alloc/bpf_dynptr_release yet. That means we need someone
> to implement an allocator for BPF programs. And, then, we can have
> bpf_dynptr_alloc unpon it. (There is an implementation of
> bpf_dynptr_alloc() in the early versions of the patchset of dynptr.
> But, be removed before landing.)
> 
> 
>>
>>> +        if (ret < 0)
>>> +            return 0;
>>> +        ctx->optlen = 1;
>>> +
>>> +        /* Userspace buffer is PAGE_SIZE * 2, but BPF
>>> +         * program can only see the first PAGE_SIZE
>>> +         * bytes of data.
>>> +         */
>>> +        if (optval_end - optval != page_size && 0)
>>> +            return 0; /* unexpected data size */
>>> +
>>> +        return 1;
>>> +    }
>>> +
>>> +    if (ctx->level != SOL_CUSTOM)
>>> +        return 0; /* deny everything except custom level */
>>> +
>>> +    if (optval + 1 > optval_end)
>>> +        return 0; /* bounds check */
>>> +
>>> +    storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0,
>>> +                     BPF_SK_STORAGE_GET_F_CREATE);
>>> +    if (!storage)
>>> +        return 0; /* couldn't get sk storage */
>>> +
>>> +    if (!ctx->retval)
>>> +        return 0; /* kernel should not have handled
>>> +               * SOL_CUSTOM, something is wrong!
>>> +               */
>>> +    ctx->retval = 0; /* Reset system call return value to zero */
>>> +
>>> +    buf[0] = storage->val;
>>> +    ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
>>> +    if (ret >= 0) {
>>> +        bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
>>> +        ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
>>> +    }
>>> +    bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>>> +    if (ret < 0)
>>> +        return 0;
>>> +    ctx->optlen = 1;
>>> +
>>> +    return 1;
>>> +}
>>> +
>>>   SEC("cgroup/setsockopt")
>>>   int _setsockopt(struct bpf_sockopt *ctx)
>>>   {
>>> @@ -144,6 +281,9 @@ int _setsockopt(struct bpf_sockopt *ctx)
>>>       struct sockopt_sk *storage;
>>>       struct bpf_sock *sk;
>>> +    if (skip_nonsleepable)
>>> +        return 1;
>>> +
>>>       /* Bypass AF_NETLINK. */
>>>       sk = ctx->sk;
>>>       if (sk && sk->family == AF_NETLINK)
>>> @@ -236,3 +376,120 @@ int _setsockopt(struct bpf_sockopt *ctx)
>>>           ctx->optlen = 0;
>>>       return 1;
>>>   }
>>> +
>>> +SEC("cgroup/setsockopt.s")
>>> +int _setsockopt_s(struct bpf_sockopt *ctx)
>>> +{
>>> +    struct bpf_dynptr optval_buf;
>>> +    struct sockopt_sk *storage;
>>> +    __u8 *optval, *optval_end;
>>> +    struct bpf_sock *sk;
>>> +    __u8 tmp_u8;
>>> +    __u32 tmp;
>>> +    int ret;
>>> +
>>> +    if (skip_sleepable)
>>> +        return 1;
>>> +
>>> +    optval = ctx->optval;
>>> +    optval_end = ctx->optval_end;
>>> +
>>> +    /* Bypass AF_NETLINK. */
>>> +    sk = ctx->sk;
>>> +    if (sk && sk->family == AF_NETLINK)
>>> +        return -1;
>>> +
>>> +    /* Make sure bpf_get_netns_cookie is callable.
>>> +     */
>>> +    if (bpf_get_netns_cookie(NULL) == 0)
>>> +        return 0;
>>> +
>>> +    if (bpf_get_netns_cookie(ctx) == 0)
>>> +        return 0;
>>> +
>>> +    if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
>>> +        /* Not interested in SOL_IP:IP_TOS;
>>> +         * let next BPF program in the cgroup chain or kernel
>>> +         * handle it.
>>> +         */
>>> +        ctx->optlen = 0; /* bypass optval>PAGE_SIZE */
>>> +        return 1;
>>> +    }
>>> +
>>> +    if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
>>> +        /* Overwrite SO_SNDBUF value */
>>> +
>>> +        ret = bpf_sockopt_dynptr_alloc(ctx, sizeof(__u32),
>>> +                           &optval_buf);
>>> +        if (ret < 0)
>>> +            bpf_sockopt_dynptr_release(ctx, &optval_buf);
>>> +        else {
>>> +            tmp = 0x55AA;
>>> +            bpf_dynptr_write(&optval_buf, 0, &tmp, sizeof(tmp), 0);
>>> +            ret = bpf_sockopt_dynptr_install(ctx, &optval_buf);
>>
>> One thing I'm still slightly confused about is
>> bpf_sockopt_dynptr_install vs bpf_sockopt_dynptr_copy_to. I do
>> understand that it comes from getsockopt vs setsockopt (and the ability,
>> in setsockopt, to allocate larger buffers), but I wonder whether
>> we can hide everything under single bpf_sockopt_dynptr_copy_to call?
>>
>> For getsockopt, it stays as is. For setsockopt, it would work like
>> _install currently does. Would that work? From user perspective,
>> if we provide a simple call that does the right thing, seems a bit
>> more usable? The only problem is probably the fact that _install
>> explicitly moves the ownership, but I don't see why copy_to can't
>> have the same "consume" semantics?

Sorry for missing this part!
This overloading is counterintuitive for me.
*_copy_to() will not copy/overwrite the buffer, but replace the buffer
instead. And, it will change its side-effects according to its context.
I would prefer a different name instead of reusing *_copy_to().

We probably need a better name, instead of copy_to, in order to merge
these two functions if we want to.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
  2023-08-15 20:58   ` Stanislav Fomichev
@ 2023-08-16  0:42   ` kernel test robot
  1 sibling, 0 replies; 28+ messages in thread
From: kernel test robot @ 2023-08-16  0:42 UTC (permalink / raw)
  To: thinker.li; +Cc: oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/thinker-li-gmail-com/bpf-enable-sleepable-BPF-programs-attached-to-cgroup-get-set-sockopt/20230816-014936
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230815174712.660956-2-thinker.li%40gmail.com
patch subject: [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
config: i386-randconfig-i061-20230816 (https://download.01.org/0day-ci/archive/20230816/202308160854.eioAO9dg-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce: (https://download.01.org/0day-ci/archive/20230816/202308160854.eioAO9dg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308160854.eioAO9dg-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/bpf/cgroup.c:1868:41: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char *optval @@
   kernel/bpf/cgroup.c:1868:41: sparse:     expected void const [noderef] __user *from
   kernel/bpf/cgroup.c:1868:41: sparse:     got char *optval
>> kernel/bpf/cgroup.c:1894:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[addressable] [assigned] optval @@     got char [noderef] __user *optval @@
   kernel/bpf/cgroup.c:1894:20: sparse:     expected unsigned char [usertype] *[addressable] [assigned] optval
   kernel/bpf/cgroup.c:1894:20: sparse:     got char [noderef] __user *optval
>> kernel/bpf/cgroup.c:1895:24: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[addressable] [assigned] optval_end @@     got char [noderef] __user * @@
   kernel/bpf/cgroup.c:1895:24: sparse:     expected unsigned char [usertype] *[addressable] [assigned] optval_end
   kernel/bpf/cgroup.c:1895:24: sparse:     got char [noderef] __user *
>> kernel/bpf/cgroup.c:1980:41: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got char *[assigned] optval @@
   kernel/bpf/cgroup.c:1980:41: sparse:     expected void const [noderef] __user *from
   kernel/bpf/cgroup.c:1980:41: sparse:     got char *[assigned] optval
   kernel/bpf/cgroup.c:2033:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[addressable] [assigned] optval @@     got char [noderef] __user *optval @@
   kernel/bpf/cgroup.c:2033:20: sparse:     expected unsigned char [usertype] *[addressable] [assigned] optval
   kernel/bpf/cgroup.c:2033:20: sparse:     got char [noderef] __user *optval
   kernel/bpf/cgroup.c:2034:24: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[addressable] [assigned] optval_end @@     got char [noderef] __user * @@
   kernel/bpf/cgroup.c:2034:24: sparse:     expected unsigned char [usertype] *[addressable] [assigned] optval_end
   kernel/bpf/cgroup.c:2034:24: sparse:     got char [noderef] __user *
   kernel/bpf/cgroup.c: note: in included file (through include/linux/rculist.h, include/linux/pid.h, include/linux/sched.h, ...):
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_lsm_sock' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_lsm_socket' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_lsm_current' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_skb' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sk' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sock_addr' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sock_ops' - unexpected unlock
   kernel/bpf/cgroup.c:67:24: sparse: sparse: context imbalance in '__cgroup_bpf_check_dev_permission' - different lock contexts for basic block
   kernel/bpf/cgroup.c:67:24: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sysctl' - different lock contexts for basic block
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_setsockopt' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_getsockopt' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_getsockopt_kern' - unexpected unlock

vim +1868 kernel/bpf/cgroup.c

  1841	
  1842	static int filter_setsockopt_progs_cb(void *arg,
  1843					      const struct bpf_prog_array *progs)
  1844	{
  1845		struct filter_sockopt_cb_args *cb_args = arg;
  1846		struct bpf_sockopt_kern *ctx = cb_args->ctx;
  1847		char *optval = ctx->optval;
  1848		int max_optlen;
  1849	
  1850		if (!(progs->flags & BPF_PROG_ARRAY_F_NON_SLEEPABLE))
  1851			return 0;
  1852	
  1853		/* Allocate a bit more than the initial user buffer for
  1854		 * BPF program. The canonical use case is overriding
  1855		 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
  1856		 */
  1857		max_optlen = max_t(int, 16, ctx->optlen);
  1858		/* We need to force allocating from heap if there are sleepable
  1859		 * programs since they may created dynptrs from ctx->optval. In
  1860		 * this case, dynptrs will try to free the buffer that is actually
  1861		 * on the stack without this flag.
  1862		 */
  1863		max_optlen = sockopt_alloc_buf(ctx, max_optlen, cb_args->buf,
  1864					       progs->flags & BPF_PROG_ARRAY_F_SLEEPABLE);
  1865		if (max_optlen < 0)
  1866			return max_optlen;
  1867	
> 1868		if (copy_from_user(ctx->optval, optval,
  1869				   min(ctx->optlen, max_optlen)) != 0)
  1870			return -EFAULT;
  1871	
  1872		return 0;
  1873	}
  1874	
  1875	int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
  1876					       int *optname, char __user *optval,
  1877					       int *optlen, char **kernel_optval)
  1878	{
  1879		struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
  1880		struct bpf_sockopt_buf buf = {};
  1881		struct bpf_sockopt_kern ctx = {
  1882			.sk = sk,
  1883			.level = *level,
  1884			.optname = *optname,
  1885		};
  1886		struct filter_sockopt_cb_args cb_args = {
  1887			.ctx = &ctx,
  1888			.buf = &buf,
  1889		};
  1890		int ret, max_optlen;
  1891	
  1892		max_optlen = *optlen;
  1893		ctx.optlen = *optlen;
> 1894		ctx.optval = optval;
> 1895		ctx.optval_end = optval + *optlen;
  1896		ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
  1897	
  1898		lock_sock(sk);
  1899		ret = bpf_prog_run_array_cg_cb(&cgrp->bpf, CGROUP_SETSOCKOPT,
  1900					       &ctx, bpf_prog_run, 0, NULL,
  1901					       filter_setsockopt_progs_cb, &cb_args);
  1902		release_sock(sk);
  1903	
  1904		if (ret)
  1905			goto out;
  1906	
  1907		if (ctx.optlen == -1) {
  1908			/* optlen set to -1, bypass kernel */
  1909			ret = 1;
  1910		} else if (ctx.optlen > (ctx.optval_end - ctx.optval) ||
  1911			   ctx.optlen < -1) {
  1912			/* optlen is out of bounds */
  1913			if (*optlen > PAGE_SIZE && ctx.optlen >= 0) {
  1914				pr_info_once("bpf setsockopt: ignoring program buffer with optlen=%d (max_optlen=%d)\n",
  1915					     ctx.optlen, max_optlen);
  1916				ret = 0;
  1917				goto out;
  1918			}
  1919			ret = -EFAULT;
  1920		} else {
  1921			/* optlen within bounds, run kernel handler */
  1922			ret = 0;
  1923	
  1924			/* export any potential modifications */
  1925			*level = ctx.level;
  1926			*optname = ctx.optname;
  1927	
  1928			/* optlen == 0 from BPF indicates that we should
  1929			 * use original userspace data.
  1930			 */
  1931			if (ctx.optlen != 0) {
  1932				*optlen = ctx.optlen;
  1933				if (ctx.flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
  1934					return 0;
  1935				/* We've used bpf_sockopt_kern->buf as an intermediary
  1936				 * storage, but the BPF program indicates that we need
  1937				 * to pass this data to the kernel setsockopt handler.
  1938				 * No way to export on-stack buf, have to allocate a
  1939				 * new buffer.
  1940				 */
  1941				if (!sockopt_buf_allocated(&ctx, &buf)) {
  1942					void *p = kmalloc(ctx.optlen, GFP_USER);
  1943	
  1944					if (!p) {
  1945						ret = -ENOMEM;
  1946						goto out;
  1947					}
  1948					memcpy(p, ctx.optval, ctx.optlen);
  1949					*kernel_optval = p;
  1950				} else {
  1951					*kernel_optval = ctx.optval;
  1952				}
  1953				/* export and don't free sockopt buf */
  1954				return 0;
  1955			}
  1956		}
  1957	
  1958	out:
  1959		sockopt_free_buf(&ctx, &buf);
  1960		return ret;
  1961	}
  1962	
  1963	static int filter_getsockopt_progs_cb(void *arg,
  1964					      const struct bpf_prog_array *progs)
  1965	{
  1966		struct filter_sockopt_cb_args *cb_args = arg;
  1967		struct bpf_sockopt_kern *ctx = cb_args->ctx;
  1968		int max_optlen;
  1969		char *optval;
  1970	
  1971		if (!(progs->flags & BPF_PROG_ARRAY_F_NON_SLEEPABLE))
  1972			return 0;
  1973	
  1974		optval = ctx->optval;
  1975		max_optlen = sockopt_alloc_buf(ctx, cb_args->max_optlen,
  1976					       cb_args->buf, false);
  1977		if (max_optlen < 0)
  1978			return max_optlen;
  1979	
> 1980		if (copy_from_user(ctx->optval, optval,
  1981				   min(ctx->optlen, max_optlen)) != 0)
  1982			return -EFAULT;
  1983	
  1984		ctx->flags = 0;
  1985		cb_args->max_optlen = max_optlen;
  1986	
  1987		return 0;
  1988	}
  1989	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-15 17:47 ` [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT thinker.li
@ 2023-08-16 14:19   ` kernel test robot
  2023-08-17  1:25   ` Alexei Starovoitov
  1 sibling, 0 replies; 28+ messages in thread
From: kernel test robot @ 2023-08-16 14:19 UTC (permalink / raw)
  To: thinker.li; +Cc: oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/thinker-li-gmail-com/bpf-enable-sleepable-BPF-programs-attached-to-cgroup-get-set-sockopt/20230816-014936
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230815174712.660956-5-thinker.li%40gmail.com
patch subject: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
config: x86_64-buildonly-randconfig-r001-20230816 (https://download.01.org/0day-ci/archive/20230816/202308162201.NR58Gd8V-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce: (https://download.01.org/0day-ci/archive/20230816/202308162201.NR58Gd8V-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308162201.NR58Gd8V-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/bpf/helpers.c:1892:19: warning: no previous declaration for 'bpf_obj_new_impl' [-Wmissing-declarations]
    __bpf_kfunc void *bpf_obj_new_impl(u64 local_type_id__k, void *meta__ign)
                      ^~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:1922:18: warning: no previous declaration for 'bpf_obj_drop_impl' [-Wmissing-declarations]
    __bpf_kfunc void bpf_obj_drop_impl(void *p__alloc, void *meta__ign)
                     ^~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:1930:19: warning: no previous declaration for 'bpf_refcount_acquire_impl' [-Wmissing-declarations]
    __bpf_kfunc void *bpf_refcount_acquire_impl(void *p__refcounted_kptr, void *meta__ign)
                      ^~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:1975:17: warning: no previous declaration for 'bpf_list_push_front_impl' [-Wmissing-declarations]
    __bpf_kfunc int bpf_list_push_front_impl(struct bpf_list_head *head,
                    ^~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:1985:17: warning: no previous declaration for 'bpf_list_push_back_impl' [-Wmissing-declarations]
    __bpf_kfunc int bpf_list_push_back_impl(struct bpf_list_head *head,
                    ^~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2018:35: warning: no previous declaration for 'bpf_list_pop_front' [-Wmissing-declarations]
    __bpf_kfunc struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head)
                                      ^~~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2023:35: warning: no previous declaration for 'bpf_list_pop_back' [-Wmissing-declarations]
    __bpf_kfunc struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
                                      ^~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2028:33: warning: no previous declaration for 'bpf_rbtree_remove' [-Wmissing-declarations]
    __bpf_kfunc struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
                                    ^~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2084:17: warning: no previous declaration for 'bpf_rbtree_add_impl' [-Wmissing-declarations]
    __bpf_kfunc int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
                    ^~~~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2094:33: warning: no previous declaration for 'bpf_rbtree_first' [-Wmissing-declarations]
    __bpf_kfunc struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root)
                                    ^~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2107:33: warning: no previous declaration for 'bpf_task_acquire' [-Wmissing-declarations]
    __bpf_kfunc struct task_struct *bpf_task_acquire(struct task_struct *p)
                                    ^~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2118:18: warning: no previous declaration for 'bpf_task_release' [-Wmissing-declarations]
    __bpf_kfunc void bpf_task_release(struct task_struct *p)
                     ^~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2207:33: warning: no previous declaration for 'bpf_task_from_pid' [-Wmissing-declarations]
    __bpf_kfunc struct task_struct *bpf_task_from_pid(s32 pid)
                                    ^~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2248:19: warning: no previous declaration for 'bpf_dynptr_slice' [-Wmissing-declarations]
    __bpf_kfunc void *bpf_dynptr_slice(const struct bpf_dynptr_kern *ptr, u32 offset,
                      ^~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2333:19: warning: no previous declaration for 'bpf_dynptr_slice_rdwr' [-Wmissing-declarations]
    __bpf_kfunc void *bpf_dynptr_slice_rdwr(const struct bpf_dynptr_kern *ptr, u32 offset,
                      ^~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2364:17: warning: no previous declaration for 'bpf_dynptr_adjust' [-Wmissing-declarations]
    __bpf_kfunc int bpf_dynptr_adjust(struct bpf_dynptr_kern *ptr, u32 start, u32 end)
                    ^~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2382:18: warning: no previous declaration for 'bpf_dynptr_is_null' [-Wmissing-declarations]
    __bpf_kfunc bool bpf_dynptr_is_null(struct bpf_dynptr_kern *ptr)
                     ^~~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2387:18: warning: no previous declaration for 'bpf_dynptr_is_rdonly' [-Wmissing-declarations]
    __bpf_kfunc bool bpf_dynptr_is_rdonly(struct bpf_dynptr_kern *ptr)
                     ^~~~~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2395:19: warning: no previous declaration for 'bpf_dynptr_size' [-Wmissing-declarations]
    __bpf_kfunc __u32 bpf_dynptr_size(const struct bpf_dynptr_kern *ptr)
                      ^~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2403:17: warning: no previous declaration for 'bpf_dynptr_clone' [-Wmissing-declarations]
    __bpf_kfunc int bpf_dynptr_clone(struct bpf_dynptr_kern *ptr,
                    ^~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2416:19: warning: no previous declaration for 'bpf_cast_to_kern_ctx' [-Wmissing-declarations]
    __bpf_kfunc void *bpf_cast_to_kern_ctx(void *obj)
                      ^~~~~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2421:19: warning: no previous declaration for 'bpf_rdonly_cast' [-Wmissing-declarations]
    __bpf_kfunc void *bpf_rdonly_cast(void *obj__ign, u32 btf_id__k)
                      ^~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2426:18: warning: no previous declaration for 'bpf_rcu_read_lock' [-Wmissing-declarations]
    __bpf_kfunc void bpf_rcu_read_lock(void)
                     ^~~~~~~~~~~~~~~~~
   kernel/bpf/helpers.c:2431:18: warning: no previous declaration for 'bpf_rcu_read_unlock' [-Wmissing-declarations]
    __bpf_kfunc void bpf_rcu_read_unlock(void)
                     ^~~~~~~~~~~~~~~~~~~
>> kernel/bpf/helpers.c:2441:17: warning: no previous declaration for 'bpf_sockopt_dynptr_alloc' [-Wmissing-declarations]
    __bpf_kfunc int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
                    ^~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/helpers.c:2469:17: warning: no previous declaration for 'bpf_sockopt_dynptr_install' [-Wmissing-declarations]
    __bpf_kfunc int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/helpers.c:2506:17: warning: no previous declaration for 'bpf_sockopt_dynptr_release' [-Wmissing-declarations]
    __bpf_kfunc int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/helpers.c:2545:17: warning: no previous declaration for 'bpf_sockopt_dynptr_from' [-Wmissing-declarations]
    __bpf_kfunc int bpf_sockopt_dynptr_from(struct bpf_sockopt *sopt,
                    ^~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/helpers.c:2590:17: warning: no previous declaration for 'bpf_sockopt_dynptr_copy_to' [-Wmissing-declarations]
    __bpf_kfunc int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/bpf_sockopt_dynptr_alloc +2441 kernel/bpf/helpers.c

  2435	
  2436	/* Create a buffer of the given size for a {set,get}sockopt BPF filter.
  2437	 *
  2438	 * This kfunc is only avaliabe for sleeplabe contexts. The dynptr should be
  2439	 * released by bpf_sockopt_dynptr_install() or bpf_sockopt_release().
  2440	 */
> 2441	__bpf_kfunc int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
  2442						 struct bpf_dynptr_kern *ptr__uninit)
  2443	{
  2444		void *optval;
  2445		int err;
  2446	
  2447		bpf_dynptr_set_null(ptr__uninit);
  2448	
  2449		err = bpf_dynptr_check_size(size);
  2450		if (err)
  2451			return err;
  2452	
  2453		optval = kzalloc(size, GFP_KERNEL);
  2454		if (!optval)
  2455			return -ENOMEM;
  2456	
  2457		bpf_dynptr_init(ptr__uninit, optval,
  2458				BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, 0, size);
  2459	
  2460		return size;
  2461	}
  2462	
  2463	/* Install the buffer of the dynptr into the sockopt context.
  2464	 *
  2465	 * This kfunc is only avaliabe for sleeplabe contexts. The dynptr should be
  2466	 * allocated by bpf_sockopt_dynptr_alloc().  The dynptr is invalid after
  2467	 * returning from this function successfully.
  2468	 */
> 2469	__bpf_kfunc int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
  2470						   struct bpf_dynptr_kern *ptr)
  2471	{
  2472		struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
  2473	
  2474		if (!(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_REPLACE) ||
  2475		    bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT ||
  2476		    !ptr->data)
  2477			return -EINVAL;
  2478	
  2479		if (sopt_kern->optval == ptr->data &&
  2480		    !(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)) {
  2481			/* This dynptr is initialized by bpf_sockopt_dynptr_from()
  2482			 * and the optval is not overwritten by
  2483			 * bpf_sockopt_dynptr_install() yet.
  2484			 */
  2485			bpf_dynptr_set_null(ptr);
  2486			sopt_kern->flags &= ~BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR;
  2487			return 0;
  2488		}
  2489	
  2490		if (sopt_kern->optval &&
  2491		    !(sopt_kern->flags & (BPF_SOCKOPT_FLAG_OPTVAL_USER |
  2492					  BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR)))
  2493			kfree(sopt_kern->optval);
  2494	
  2495		sopt_kern->optval = ptr->data;
  2496		sopt_kern->optval_end = ptr->data + __bpf_dynptr_size(ptr);
  2497		sopt_kern->optlen = __bpf_dynptr_size(ptr);
  2498		sopt_kern->flags &= ~(BPF_SOCKOPT_FLAG_OPTVAL_USER |
  2499				      BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR);
  2500	
  2501		bpf_dynptr_set_null(ptr);
  2502	
  2503		return 0;
  2504	}
  2505	
> 2506	__bpf_kfunc int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
  2507						   struct bpf_dynptr_kern *ptr)
  2508	{
  2509		struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
  2510	
  2511		if (bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT ||
  2512		    !ptr->data)
  2513			return -EINVAL;
  2514	
  2515		if (sopt_kern->optval == ptr->data &&
  2516		    !(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER))
  2517			/* This dynptr is initialized by bpf_sockopt_dynptr_from()
  2518			 * and the optval is not overwritten by
  2519			 * bpf_sockopt_dynptr_install() yet.
  2520			 */
  2521			sopt_kern->flags &= ~BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR;
  2522		else
  2523			kfree(ptr->data);
  2524		bpf_dynptr_set_null(ptr);
  2525	
  2526		return 0;
  2527	}
  2528	
  2529	/* Initialize a sockopt dynptr from a user or installed optval pointer.
  2530	 *
  2531	 * sopt->optval can be a user pointer or a kernel pointer. A kernel pointer
  2532	 * can be a buffer allocated by the caller of the BPF program or a buffer
  2533	 * installed by other BPF programs through bpf_sockopt_dynptr_install().
  2534	 *
  2535	 * Atmost one dynptr shall be created by this function at any moment, or
  2536	 * it will return -EINVAL. You can create another dypptr by this function
  2537	 * after release the previous one by bpf_sockopt_dynptr_release().
  2538	 *
  2539	 * A dynptr that is initialized when optval is a user pointer is an
  2540	 * exception. In this case, the dynptr will point to a kernel buffer with
  2541	 * the same content as the user buffer. To simplify the code, users should
  2542	 * always make sure having only one dynptr initialized by this function at
  2543	 * any moment.
  2544	 */
> 2545	__bpf_kfunc int bpf_sockopt_dynptr_from(struct bpf_sockopt *sopt,
  2546						struct bpf_dynptr_kern *ptr__uninit,
  2547						unsigned int size)
  2548	{
  2549		struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
  2550		int err;
  2551	
  2552		bpf_dynptr_set_null(ptr__uninit);
  2553	
  2554		if (size > (sopt_kern->optval_end - sopt_kern->optval))
  2555			return -EINVAL;
  2556	
  2557		if (size == 0)
  2558			size = min(sopt_kern->optlen,
  2559				   (int)(sopt_kern->optval_end - sopt_kern->optval));
  2560	
  2561		if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR)
  2562			return -EINVAL;
  2563	
  2564		if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
  2565			err = bpf_sockopt_dynptr_alloc(sopt, sopt_kern->optlen,
  2566						       ptr__uninit);
  2567			if (err >= 0)
  2568				err = copy_from_user(ptr__uninit->data,
  2569						     sopt_kern->optval,
  2570						     size);
  2571			return err;
  2572		}
  2573	
  2574		bpf_dynptr_init(ptr__uninit, sopt_kern->optval,
  2575				BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, 0,
  2576				size);
  2577		sopt_kern->flags |= BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR;
  2578	
  2579		return size;
  2580	}
  2581	
  2582	/**
  2583	 * int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
  2584	 *				  struct bpf_dynptr_kern *ptr)
  2585	 *     Description
  2586	 *             Copy data from *ptr* to *sopt->optval*.
  2587	 *     Return
  2588	 *             >= 0 on success, or a negative error in case of failure.
  2589	 */
> 2590	__bpf_kfunc int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
  2591						   struct bpf_dynptr_kern *ptr)
  2592	{
  2593		__u32 size = bpf_dynptr_size(ptr);
  2594	
  2595		struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
  2596		int ret;
  2597	
  2598		if (size > (sopt_kern->optval_end - sopt_kern->optval))
  2599			return -EINVAL;
  2600	
  2601		if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
  2602			ret = copy_to_user(sopt_kern->optval, ptr->data,
  2603					   size);
  2604			if (unlikely(ret))
  2605				return -EFAULT;
  2606		} else {
  2607			/* Use memmove() in case of optval & ptr overlap. */
  2608			memmove(sopt_kern->optval, ptr->data, size);
  2609			ret = size;
  2610		}
  2611	
  2612		return ret;
  2613	}
  2614	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval.
  2023-08-15 17:47 ` [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval thinker.li
@ 2023-08-17  0:55   ` Martin KaFai Lau
  2023-08-17 18:10     ` Kui-Feng Lee
  2023-08-17  1:17   ` Alexei Starovoitov
  1 sibling, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2023-08-17  0:55 UTC (permalink / raw)
  To: thinker.li
  Cc: sinquersw, kuifeng, bpf, ast, song, kernel-team, andrii, sdf,
	yonghong.song

On 8/15/23 10:47 AM, thinker.li@gmail.com wrote:
> From: Kui-Feng Lee <thinker.li@gmail.com>
> 
> Since the buffer pointed by ctx->user_optval is in user space, BPF programs
> in kernel space should not access it directly.  They should use
> bpf_copy_from_user() and bpf_copy_to_user() to move data between user and
> kernel space.
> 
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>   kernel/bpf/cgroup.c   | 16 +++++++++--
>   kernel/bpf/verifier.c | 66 +++++++++++++++++++++----------------------
>   2 files changed, 47 insertions(+), 35 deletions(-)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index b977768a28e5..425094e071ba 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -2494,12 +2494,24 @@ static bool cg_sockopt_is_valid_access(int off, int size,
>   	case offsetof(struct bpf_sockopt, optval):
>   		if (size != sizeof(__u64))
>   			return false;
> -		info->reg_type = PTR_TO_PACKET;
> +		if (prog->aux->sleepable)
> +			/* Prohibit access to the memory pointed by optval
> +			 * in sleepable programs.
> +			 */
> +			info->reg_type = PTR_TO_PACKET | MEM_USER;

Is MEM_USER needed to call bpf_copy_from_user()?

Also, from looking at patch 4, the optval could be changed from user memory to 
kernel memory during runtime. Is it useful to check MEM_USER during the verifier 
load time?

How about just return false here to disallow sleepable prog to use ->optval and 
->optval_end. Enforce sleepable prog to stay with the bpf_dynptr_read/write API 
and avoid needing the optval + len > optval_end check in the sleepable bpf prog. 
WDYT?

Regarding ->optlen, do you think the sleepable prog can stay with the 
bpf_dynptr_size() and bpf_dynptr_adjust() such that no need to expose optlen to 
the sleepable prog also.

> +		else
> +			info->reg_type = PTR_TO_PACKET;
>   		break;
>   	case offsetof(struct bpf_sockopt, optval_end):
>   		if (size != sizeof(__u64))
>   			return false;
> -		info->reg_type = PTR_TO_PACKET_END;
> +		if (prog->aux->sleepable)
> +			/* Prohibit access to the memory pointed by
> +			 * optval_end in sleepable programs.
> +			 */
> +			info->reg_type = PTR_TO_PACKET_END | MEM_USER;
> +		else
> +			info->reg_type = PTR_TO_PACKET_END;
>   		break;


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type
  2023-08-16  0:03       ` Kui-Feng Lee
@ 2023-08-17  1:13         ` Martin KaFai Lau
  2023-08-17 18:36           ` Kui-Feng Lee
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2023-08-17  1:13 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: bpf, ast, song, kernel-team, andrii, yonghong.song, kuifeng,
	thinker.li, Stanislav Fomichev

On 8/15/23 5:03 PM, Kui-Feng Lee wrote:
>>>> +SEC("cgroup/getsockopt.s")
>>>> +int _getsockopt_s(struct bpf_sockopt *ctx)
>>>> +{
>>>> +    struct tcp_zerocopy_receive *zcvr;
>>>> +    struct bpf_dynptr optval_dynptr;
>>>> +    struct sockopt_sk *storage;
>>>> +    __u8 *optval, *optval_end;
>>>> +    struct bpf_sock *sk;
>>>> +    char buf[1];
>>>> +    __u64 addr;
>>>> +    int ret;
>>>> +
>>>> +    if (skip_sleepable)
>>>> +        return 1;
>>>> +
>>>> +    /* Bypass AF_NETLINK. */
>>>> +    sk = ctx->sk;
>>>> +    if (sk && sk->family == AF_NETLINK)
>>>> +        return 1;
>>>> +
>>>> +    optval = ctx->optval;
>>>> +    optval_end = ctx->optval_end;
>>>> +
>>>> +    /* Make sure bpf_get_netns_cookie is callable.
>>>> +     */
>>>> +    if (bpf_get_netns_cookie(NULL) == 0)
>>>> +        return 0;
>>>> +
>>>> +    if (bpf_get_netns_cookie(ctx) == 0)
>>>> +        return 0;
>>>> +
>>>> +    if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
>>>> +        /* Not interested in SOL_IP:IP_TOS;
>>>> +         * let next BPF program in the cgroup chain or kernel
>>>> +         * handle it.
>>>> +         */
>>>> +        return 1;
>>>> +    }
>>>> +
>>>> +    if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
>>>> +        /* Not interested in SOL_SOCKET:SO_SNDBUF;
>>>> +         * let next BPF program in the cgroup chain or kernel
>>>> +         * handle it.
>>>> +         */
>>>> +        return 1;
>>>> +    }
>>>> +
>>>> +    if (ctx->level == SOL_TCP && ctx->optname == TCP_CONGESTION) {
>>>> +        /* Not interested in SOL_TCP:TCP_CONGESTION;
>>>> +         * let next BPF program in the cgroup chain or kernel
>>>> +         * handle it.
>>>> +         */
>>>> +        return 1;
>>>> +    }
>>>> +
>>>> +    if (ctx->level == SOL_TCP && ctx->optname == TCP_ZEROCOPY_RECEIVE) {
>>>> +        /* Verify that TCP_ZEROCOPY_RECEIVE triggers.
>>>> +         * It has a custom implementation for performance
>>>> +         * reasons.
>>>> +         */
>>>> +
>>>> +        bpf_sockopt_dynptr_from(ctx, &optval_dynptr, sizeof(*zcvr));
>>>> +        zcvr = bpf_dynptr_data(&optval_dynptr, 0, sizeof(*zcvr));
>>>> +        addr = zcvr ? zcvr->address : 0;
>>>> +        bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>>>
>>> This starts to look more usable, thank you for the changes!
>>> Let me poke the api a bit more, I'm not super familiar with the dynptrs.
>>>
>>> here: bpf_sockopt_dynptr_from should probably be called
>>> bpf_dynptr_from_sockopt to match bpf_dynptr_from_mem?
>>
>> agree!
>>
>>>
>>>> +
>>>> +        return addr != 0 ? 0 : 1;
>>>> +    }
>>>> +
>>>> +    if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
>>>> +        if (optval + 1 > optval_end)
>>>> +            return 0; /* bounds check */
>>>> +
>>>> +        ctx->retval = 0; /* Reset system call return value to zero */
>>>> +
>>>> +        /* Always export 0x55 */
>>>> +        buf[0] = 0x55;
>>>> +        ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
>>>> +        if (ret >= 0) {
>>>> +            bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
>>>> +            ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
>>>> +        }
>>>> +        bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>>>
>>> Does bpf_sockopt_dynptr_alloc and bpf_sockopt_dynptr_release need to be
>>> sockopt specific? Seems like we should provide, instead, some generic
>>> bpf_dynptr_alloc/bpf_dynptr_release and make
>>> bpf_sockopt_dynptr_copy_to/install work with them? WDYT?
>>
>> I found that kmalloc can not be called in the context of nmi, slab or
>> page alloc path. It is why we don't have functions like
>> bpf_dynptr_alloc/bpf_dynptr_release yet. That means we need someone
>> to implement an allocator for BPF programs. And, then, we can have
>> bpf_dynptr_alloc unpon it. (There is an implementation of
>> bpf_dynptr_alloc() in the early versions of the patchset of dynptr.
>> But, be removed before landing.)
>>
>>
>>>
>>>> +        if (ret < 0)
>>>> +            return 0;
>>>> +        ctx->optlen = 1;
>>>> +
>>>> +        /* Userspace buffer is PAGE_SIZE * 2, but BPF
>>>> +         * program can only see the first PAGE_SIZE
>>>> +         * bytes of data.
>>>> +         */
>>>> +        if (optval_end - optval != page_size && 0)
>>>> +            return 0; /* unexpected data size */
>>>> +
>>>> +        return 1;
>>>> +    }
>>>> +
>>>> +    if (ctx->level != SOL_CUSTOM)
>>>> +        return 0; /* deny everything except custom level */
>>>> +
>>>> +    if (optval + 1 > optval_end)
>>>> +        return 0; /* bounds check */
>>>> +
>>>> +    storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0,
>>>> +                     BPF_SK_STORAGE_GET_F_CREATE);
>>>> +    if (!storage)
>>>> +        return 0; /* couldn't get sk storage */
>>>> +
>>>> +    if (!ctx->retval)
>>>> +        return 0; /* kernel should not have handled
>>>> +               * SOL_CUSTOM, something is wrong!
>>>> +               */
>>>> +    ctx->retval = 0; /* Reset system call return value to zero */
>>>> +
>>>> +    buf[0] = storage->val;
>>>> +    ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
>>>> +    if (ret >= 0) {
>>>> +        bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
>>>> +        ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
>>>> +    }
>>>> +    bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>>>> +    if (ret < 0)
>>>> +        return 0;
>>>> +    ctx->optlen = 1;
>>>> +
>>>> +    return 1;
>>>> +}
>>>> +
>>>>   SEC("cgroup/setsockopt")
>>>>   int _setsockopt(struct bpf_sockopt *ctx)
>>>>   {
>>>> @@ -144,6 +281,9 @@ int _setsockopt(struct bpf_sockopt *ctx)
>>>>       struct sockopt_sk *storage;
>>>>       struct bpf_sock *sk;
>>>> +    if (skip_nonsleepable)
>>>> +        return 1;
>>>> +
>>>>       /* Bypass AF_NETLINK. */
>>>>       sk = ctx->sk;
>>>>       if (sk && sk->family == AF_NETLINK)
>>>> @@ -236,3 +376,120 @@ int _setsockopt(struct bpf_sockopt *ctx)
>>>>           ctx->optlen = 0;
>>>>       return 1;
>>>>   }
>>>> +
>>>> +SEC("cgroup/setsockopt.s")
>>>> +int _setsockopt_s(struct bpf_sockopt *ctx)
>>>> +{
>>>> +    struct bpf_dynptr optval_buf;
>>>> +    struct sockopt_sk *storage;
>>>> +    __u8 *optval, *optval_end;
>>>> +    struct bpf_sock *sk;
>>>> +    __u8 tmp_u8;
>>>> +    __u32 tmp;
>>>> +    int ret;
>>>> +
>>>> +    if (skip_sleepable)
>>>> +        return 1;
>>>> +
>>>> +    optval = ctx->optval;
>>>> +    optval_end = ctx->optval_end;
>>>> +
>>>> +    /* Bypass AF_NETLINK. */
>>>> +    sk = ctx->sk;
>>>> +    if (sk && sk->family == AF_NETLINK)
>>>> +        return -1;
>>>> +
>>>> +    /* Make sure bpf_get_netns_cookie is callable.
>>>> +     */
>>>> +    if (bpf_get_netns_cookie(NULL) == 0)
>>>> +        return 0;
>>>> +
>>>> +    if (bpf_get_netns_cookie(ctx) == 0)
>>>> +        return 0;
>>>> +
>>>> +    if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
>>>> +        /* Not interested in SOL_IP:IP_TOS;
>>>> +         * let next BPF program in the cgroup chain or kernel
>>>> +         * handle it.
>>>> +         */
>>>> +        ctx->optlen = 0; /* bypass optval>PAGE_SIZE */
>>>> +        return 1;
>>>> +    }
>>>> +
>>>> +    if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
>>>> +        /* Overwrite SO_SNDBUF value */
>>>> +
>>>> +        ret = bpf_sockopt_dynptr_alloc(ctx, sizeof(__u32),
>>>> +                           &optval_buf);
>>>> +        if (ret < 0)
>>>> +            bpf_sockopt_dynptr_release(ctx, &optval_buf);
>>>> +        else {
>>>> +            tmp = 0x55AA;
>>>> +            bpf_dynptr_write(&optval_buf, 0, &tmp, sizeof(tmp), 0);
>>>> +            ret = bpf_sockopt_dynptr_install(ctx, &optval_buf);
>>>
>>> One thing I'm still slightly confused about is
>>> bpf_sockopt_dynptr_install vs bpf_sockopt_dynptr_copy_to. I do
>>> understand that it comes from getsockopt vs setsockopt (and the ability,
>>> in setsockopt, to allocate larger buffers), but I wonder whether
>>> we can hide everything under single bpf_sockopt_dynptr_copy_to call?
>>>
>>> For getsockopt, it stays as is. For setsockopt, it would work like
>>> _install currently does. Would that work? From user perspective,
>>> if we provide a simple call that does the right thing, seems a bit
>>> more usable? The only problem is probably the fact that _install
>>> explicitly moves the ownership, but I don't see why copy_to can't
>>> have the same "consume" semantics?
> 
> Sorry for missing this part!
> This overloading is counterintuitive for me.
> *_copy_to() will not copy/overwrite the buffer, but replace the buffer
> instead. And, it will change its side-effects according to its context.
> I would prefer a different name instead of reusing *_copy_to().
> 
> We probably need a better name, instead of copy_to, in order to merge
> these two functions if we want to.

It also took me some time to realize the alloc/install/copy_to/release usages. 
iiuc, to change optval in getsockopt, it is alloc=>write=>copy_to=>release.
       to change optval in setsockopt, it is alloc=>write=>install.

Can this alloc and user-or-kernel memory details be done in the 
bpf_dynptr_write() instead such that both BPF_CGROUP_GETSOCKOPT and 
BPF_CGROUP_SETSOCKOPT program can just call one bpf_dynptr_write() to update the 
optval? I meant bpf_dynptr_write() should have the needed context to decide if 
it can directly write to the __user ptr or it needs to write to a kernel memory 
and if kmalloc is needed/allowed.

Same for reading. bpf_dynptr_read() should know it should read from user or 
kernel memory. bpf_dynptr_data() may not work well if the underlying sockopt 
memory is still backed by user memory, so probably don't support 
bpf_dynptr_data() for sockopt. Support the bpf_dynptr_slice() and 
bpf_dynptr_slice_rdwr() instead. Take a look at how the "buffer__opt" arg is 
used by the xdp dynptr. If the underlying sockopt is still backed by user 
memory, the bpf_dynptr_slice() can do a copy_from_user to the "buffer__opt".

bpf_dynptr_from_cgrp_sockopt() should be the only function to initalize a dynptr 
from the 'struct bpf_sockopt *ctx'. Probably don't need to allocate any memory 
at the init time.

Is there something specific to cgrp sockopt that is difficult and any downside 
of the above?

WDYT?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval.
  2023-08-15 17:47 ` [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval thinker.li
  2023-08-17  0:55   ` Martin KaFai Lau
@ 2023-08-17  1:17   ` Alexei Starovoitov
  2023-08-17 18:12     ` Kui-Feng Lee
  1 sibling, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2023-08-17  1:17 UTC (permalink / raw)
  To: thinker.li
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song, sinquersw, kuifeng

On Tue, Aug 15, 2023 at 10:47:10AM -0700, thinker.li@gmail.com wrote:
> From: Kui-Feng Lee <thinker.li@gmail.com>
> 
> Since the buffer pointed by ctx->user_optval is in user space, BPF programs
> in kernel space should not access it directly.  They should use
> bpf_copy_from_user() and bpf_copy_to_user() to move data between user and
> kernel space.
> 
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  kernel/bpf/cgroup.c   | 16 +++++++++--
>  kernel/bpf/verifier.c | 66 +++++++++++++++++++++----------------------
>  2 files changed, 47 insertions(+), 35 deletions(-)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index b977768a28e5..425094e071ba 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -2494,12 +2494,24 @@ static bool cg_sockopt_is_valid_access(int off, int size,
>  	case offsetof(struct bpf_sockopt, optval):
>  		if (size != sizeof(__u64))
>  			return false;
> -		info->reg_type = PTR_TO_PACKET;
> +		if (prog->aux->sleepable)
> +			/* Prohibit access to the memory pointed by optval
> +			 * in sleepable programs.
> +			 */
> +			info->reg_type = PTR_TO_PACKET | MEM_USER;
> +		else
> +			info->reg_type = PTR_TO_PACKET;
>  		break;
>  	case offsetof(struct bpf_sockopt, optval_end):
>  		if (size != sizeof(__u64))
>  			return false;
> -		info->reg_type = PTR_TO_PACKET_END;
> +		if (prog->aux->sleepable)
> +			/* Prohibit access to the memory pointed by
> +			 * optval_end in sleepable programs.
> +			 */
> +			info->reg_type = PTR_TO_PACKET_END | MEM_USER;

This doesn't look correct.
packet memory and user memory are non overlapping address spaces.
There cannot be a packet memory that is also and user memory.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-15 17:47 ` [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT thinker.li
  2023-08-16 14:19   ` kernel test robot
@ 2023-08-17  1:25   ` Alexei Starovoitov
  2023-08-17 19:00     ` Kui-Feng Lee
  2023-08-17 20:41     ` Martin KaFai Lau
  1 sibling, 2 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2023-08-17  1:25 UTC (permalink / raw)
  To: thinker.li
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song, sinquersw, kuifeng

On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote:
>  
> +BTF_SET8_START(cgroup_common_btf_ids)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
> +BTF_SET8_END(cgroup_common_btf_ids)

These shouldn't be sockopt specific.
If we want dynptr to represent a pointer to a user contiguous user memory
we should use generic kfunc that do so.

I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do.
New dynptr type can be hidden in the kernel and all existing
kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work
with user memory.

But I think we have to step back. Why do we need this whole thing in the first place?
_why_ sockopt bpf progs needs to read and write user memory?

Yes there is one page limit, but what is the use case to actually read and write
beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
anything useful with iptables binary blobs. They are hard enough for kernel to parse.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval.
  2023-08-17  0:55   ` Martin KaFai Lau
@ 2023-08-17 18:10     ` Kui-Feng Lee
  0 siblings, 0 replies; 28+ messages in thread
From: Kui-Feng Lee @ 2023-08-17 18:10 UTC (permalink / raw)
  To: Martin KaFai Lau, thinker.li
  Cc: kuifeng, bpf, ast, song, kernel-team, andrii, sdf, yonghong.song



On 8/16/23 17:55, Martin KaFai Lau wrote:
> On 8/15/23 10:47 AM, thinker.li@gmail.com wrote:
>> From: Kui-Feng Lee <thinker.li@gmail.com>
>>
>> Since the buffer pointed by ctx->user_optval is in user space, BPF 
>> programs
>> in kernel space should not access it directly.  They should use
>> bpf_copy_from_user() and bpf_copy_to_user() to move data between user and
>> kernel space.
>>
>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>> ---
>>   kernel/bpf/cgroup.c   | 16 +++++++++--
>>   kernel/bpf/verifier.c | 66 +++++++++++++++++++++----------------------
>>   2 files changed, 47 insertions(+), 35 deletions(-)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index b977768a28e5..425094e071ba 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -2494,12 +2494,24 @@ static bool cg_sockopt_is_valid_access(int 
>> off, int size,
>>       case offsetof(struct bpf_sockopt, optval):
>>           if (size != sizeof(__u64))
>>               return false;
>> -        info->reg_type = PTR_TO_PACKET;
>> +        if (prog->aux->sleepable)
>> +            /* Prohibit access to the memory pointed by optval
>> +             * in sleepable programs.
>> +             */
>> +            info->reg_type = PTR_TO_PACKET | MEM_USER;
> 
> Is MEM_USER needed to call bpf_copy_from_user()?
> 
> Also, from looking at patch 4, the optval could be changed from user 
> memory to kernel memory during runtime. Is it useful to check MEM_USER 
> during the verifier load time?

It has been checked.
optval & optval_end are exported and can be used to compute the size
of the user space buffer. However, it can not be used to read the
content of the user space buffer.

To be specific, __check_mem_access() will fail due to having MEM_USER
in reg->type. However, it is implicit. I will make it explicit if
necessary.

> 
> How about just return false here to disallow sleepable prog to use 
> ->optval and ->optval_end. Enforce sleepable prog to stay with the 
> bpf_dynptr_read/write API and avoid needing the optval + len > 
> optval_end check in the sleepable bpf prog. WDYT?

Then, we need to export another variable to get the size of the buffer
pointed by optval. Then, I would like to have a new context type
instead of struct bpf_sockopt for the sleepable programs. With the new
type, we can remove optval & optval_end completely from the view of
sleepable ones. They will get errors of accessing optval & optval_end
as early as compile time.

> 
> Regarding ->optlen, do you think the sleepable prog can stay with the 
> bpf_dynptr_size() and bpf_dynptr_adjust() such that no need to expose 
> optlen to the sleepable prog also.
> 
>> +        else
>> +            info->reg_type = PTR_TO_PACKET;
>>           break;
>>       case offsetof(struct bpf_sockopt, optval_end):
>>           if (size != sizeof(__u64))
>>               return false;
>> -        info->reg_type = PTR_TO_PACKET_END;
>> +        if (prog->aux->sleepable)
>> +            /* Prohibit access to the memory pointed by
>> +             * optval_end in sleepable programs.
>> +             */
>> +            info->reg_type = PTR_TO_PACKET_END | MEM_USER;
>> +        else
>> +            info->reg_type = PTR_TO_PACKET_END;
>>           break;
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval.
  2023-08-17  1:17   ` Alexei Starovoitov
@ 2023-08-17 18:12     ` Kui-Feng Lee
  0 siblings, 0 replies; 28+ messages in thread
From: Kui-Feng Lee @ 2023-08-17 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, thinker.li
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song, kuifeng



On 8/16/23 18:17, Alexei Starovoitov wrote:
> On Tue, Aug 15, 2023 at 10:47:10AM -0700, thinker.li@gmail.com wrote:
>> From: Kui-Feng Lee <thinker.li@gmail.com>
>>
>> Since the buffer pointed by ctx->user_optval is in user space, BPF programs
>> in kernel space should not access it directly.  They should use
>> bpf_copy_from_user() and bpf_copy_to_user() to move data between user and
>> kernel space.
>>
>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>> ---
>>   kernel/bpf/cgroup.c   | 16 +++++++++--
>>   kernel/bpf/verifier.c | 66 +++++++++++++++++++++----------------------
>>   2 files changed, 47 insertions(+), 35 deletions(-)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index b977768a28e5..425094e071ba 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -2494,12 +2494,24 @@ static bool cg_sockopt_is_valid_access(int off, int size,
>>   	case offsetof(struct bpf_sockopt, optval):
>>   		if (size != sizeof(__u64))
>>   			return false;
>> -		info->reg_type = PTR_TO_PACKET;
>> +		if (prog->aux->sleepable)
>> +			/* Prohibit access to the memory pointed by optval
>> +			 * in sleepable programs.
>> +			 */
>> +			info->reg_type = PTR_TO_PACKET | MEM_USER;
>> +		else
>> +			info->reg_type = PTR_TO_PACKET;
>>   		break;
>>   	case offsetof(struct bpf_sockopt, optval_end):
>>   		if (size != sizeof(__u64))
>>   			return false;
>> -		info->reg_type = PTR_TO_PACKET_END;
>> +		if (prog->aux->sleepable)
>> +			/* Prohibit access to the memory pointed by
>> +			 * optval_end in sleepable programs.
>> +			 */
>> +			info->reg_type = PTR_TO_PACKET_END | MEM_USER;
> 
> This doesn't look correct.
> packet memory and user memory are non overlapping address spaces.
> There cannot be a packet memory that is also and user memory.

Got it! I will find a new type to replace this one.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type
  2023-08-17  1:13         ` Martin KaFai Lau
@ 2023-08-17 18:36           ` Kui-Feng Lee
  0 siblings, 0 replies; 28+ messages in thread
From: Kui-Feng Lee @ 2023-08-17 18:36 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, ast, song, kernel-team, andrii, yonghong.song, kuifeng,
	thinker.li, Stanislav Fomichev



On 8/16/23 18:13, Martin KaFai Lau wrote:
> On 8/15/23 5:03 PM, Kui-Feng Lee wrote:
>>>>> +SEC("cgroup/getsockopt.s")
>>>>> +int _getsockopt_s(struct bpf_sockopt *ctx)
>>>>> +{
>>>>> +    struct tcp_zerocopy_receive *zcvr;
>>>>> +    struct bpf_dynptr optval_dynptr;
>>>>> +    struct sockopt_sk *storage;
>>>>> +    __u8 *optval, *optval_end;
>>>>> +    struct bpf_sock *sk;
>>>>> +    char buf[1];
>>>>> +    __u64 addr;
>>>>> +    int ret;
>>>>> +
>>>>> +    if (skip_sleepable)
>>>>> +        return 1;
>>>>> +
>>>>> +    /* Bypass AF_NETLINK. */
>>>>> +    sk = ctx->sk;
>>>>> +    if (sk && sk->family == AF_NETLINK)
>>>>> +        return 1;
>>>>> +
>>>>> +    optval = ctx->optval;
>>>>> +    optval_end = ctx->optval_end;
>>>>> +
>>>>> +    /* Make sure bpf_get_netns_cookie is callable.
>>>>> +     */
>>>>> +    if (bpf_get_netns_cookie(NULL) == 0)
>>>>> +        return 0;
>>>>> +
>>>>> +    if (bpf_get_netns_cookie(ctx) == 0)
>>>>> +        return 0;
>>>>> +
>>>>> +    if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
>>>>> +        /* Not interested in SOL_IP:IP_TOS;
>>>>> +         * let next BPF program in the cgroup chain or kernel
>>>>> +         * handle it.
>>>>> +         */
>>>>> +        return 1;
>>>>> +    }
>>>>> +
>>>>> +    if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
>>>>> +        /* Not interested in SOL_SOCKET:SO_SNDBUF;
>>>>> +         * let next BPF program in the cgroup chain or kernel
>>>>> +         * handle it.
>>>>> +         */
>>>>> +        return 1;
>>>>> +    }
>>>>> +
>>>>> +    if (ctx->level == SOL_TCP && ctx->optname == TCP_CONGESTION) {
>>>>> +        /* Not interested in SOL_TCP:TCP_CONGESTION;
>>>>> +         * let next BPF program in the cgroup chain or kernel
>>>>> +         * handle it.
>>>>> +         */
>>>>> +        return 1;
>>>>> +    }
>>>>> +
>>>>> +    if (ctx->level == SOL_TCP && ctx->optname == 
>>>>> TCP_ZEROCOPY_RECEIVE) {
>>>>> +        /* Verify that TCP_ZEROCOPY_RECEIVE triggers.
>>>>> +         * It has a custom implementation for performance
>>>>> +         * reasons.
>>>>> +         */
>>>>> +
>>>>> +        bpf_sockopt_dynptr_from(ctx, &optval_dynptr, sizeof(*zcvr));
>>>>> +        zcvr = bpf_dynptr_data(&optval_dynptr, 0, sizeof(*zcvr));
>>>>> +        addr = zcvr ? zcvr->address : 0;
>>>>> +        bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>>>>
>>>> This starts to look more usable, thank you for the changes!
>>>> Let me poke the api a bit more, I'm not super familiar with the 
>>>> dynptrs.
>>>>
>>>> here: bpf_sockopt_dynptr_from should probably be called
>>>> bpf_dynptr_from_sockopt to match bpf_dynptr_from_mem?
>>>
>>> agree!
>>>
>>>>
>>>>> +
>>>>> +        return addr != 0 ? 0 : 1;
>>>>> +    }
>>>>> +
>>>>> +    if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
>>>>> +        if (optval + 1 > optval_end)
>>>>> +            return 0; /* bounds check */
>>>>> +
>>>>> +        ctx->retval = 0; /* Reset system call return value to zero */
>>>>> +
>>>>> +        /* Always export 0x55 */
>>>>> +        buf[0] = 0x55;
>>>>> +        ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
>>>>> +        if (ret >= 0) {
>>>>> +            bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
>>>>> +            ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
>>>>> +        }
>>>>> +        bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>>>>
>>>> Does bpf_sockopt_dynptr_alloc and bpf_sockopt_dynptr_release need to be
>>>> sockopt specific? Seems like we should provide, instead, some generic
>>>> bpf_dynptr_alloc/bpf_dynptr_release and make
>>>> bpf_sockopt_dynptr_copy_to/install work with them? WDYT?
>>>
>>> I found that kmalloc can not be called in the context of nmi, slab or
>>> page alloc path. It is why we don't have functions like
>>> bpf_dynptr_alloc/bpf_dynptr_release yet. That means we need someone
>>> to implement an allocator for BPF programs. And, then, we can have
>>> bpf_dynptr_alloc unpon it. (There is an implementation of
>>> bpf_dynptr_alloc() in the early versions of the patchset of dynptr.
>>> But, be removed before landing.)
>>>
>>>
>>>>
>>>>> +        if (ret < 0)
>>>>> +            return 0;
>>>>> +        ctx->optlen = 1;
>>>>> +
>>>>> +        /* Userspace buffer is PAGE_SIZE * 2, but BPF
>>>>> +         * program can only see the first PAGE_SIZE
>>>>> +         * bytes of data.
>>>>> +         */
>>>>> +        if (optval_end - optval != page_size && 0)
>>>>> +            return 0; /* unexpected data size */
>>>>> +
>>>>> +        return 1;
>>>>> +    }
>>>>> +
>>>>> +    if (ctx->level != SOL_CUSTOM)
>>>>> +        return 0; /* deny everything except custom level */
>>>>> +
>>>>> +    if (optval + 1 > optval_end)
>>>>> +        return 0; /* bounds check */
>>>>> +
>>>>> +    storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0,
>>>>> +                     BPF_SK_STORAGE_GET_F_CREATE);
>>>>> +    if (!storage)
>>>>> +        return 0; /* couldn't get sk storage */
>>>>> +
>>>>> +    if (!ctx->retval)
>>>>> +        return 0; /* kernel should not have handled
>>>>> +               * SOL_CUSTOM, something is wrong!
>>>>> +               */
>>>>> +    ctx->retval = 0; /* Reset system call return value to zero */
>>>>> +
>>>>> +    buf[0] = storage->val;
>>>>> +    ret = bpf_sockopt_dynptr_alloc(ctx, 1, &optval_dynptr);
>>>>> +    if (ret >= 0) {
>>>>> +        bpf_dynptr_write(&optval_dynptr, 0, buf, 1, 0);
>>>>> +        ret = bpf_sockopt_dynptr_copy_to(ctx, &optval_dynptr);
>>>>> +    }
>>>>> +    bpf_sockopt_dynptr_release(ctx, &optval_dynptr);
>>>>> +    if (ret < 0)
>>>>> +        return 0;
>>>>> +    ctx->optlen = 1;
>>>>> +
>>>>> +    return 1;
>>>>> +}
>>>>> +
>>>>>   SEC("cgroup/setsockopt")
>>>>>   int _setsockopt(struct bpf_sockopt *ctx)
>>>>>   {
>>>>> @@ -144,6 +281,9 @@ int _setsockopt(struct bpf_sockopt *ctx)
>>>>>       struct sockopt_sk *storage;
>>>>>       struct bpf_sock *sk;
>>>>> +    if (skip_nonsleepable)
>>>>> +        return 1;
>>>>> +
>>>>>       /* Bypass AF_NETLINK. */
>>>>>       sk = ctx->sk;
>>>>>       if (sk && sk->family == AF_NETLINK)
>>>>> @@ -236,3 +376,120 @@ int _setsockopt(struct bpf_sockopt *ctx)
>>>>>           ctx->optlen = 0;
>>>>>       return 1;
>>>>>   }
>>>>> +
>>>>> +SEC("cgroup/setsockopt.s")
>>>>> +int _setsockopt_s(struct bpf_sockopt *ctx)
>>>>> +{
>>>>> +    struct bpf_dynptr optval_buf;
>>>>> +    struct sockopt_sk *storage;
>>>>> +    __u8 *optval, *optval_end;
>>>>> +    struct bpf_sock *sk;
>>>>> +    __u8 tmp_u8;
>>>>> +    __u32 tmp;
>>>>> +    int ret;
>>>>> +
>>>>> +    if (skip_sleepable)
>>>>> +        return 1;
>>>>> +
>>>>> +    optval = ctx->optval;
>>>>> +    optval_end = ctx->optval_end;
>>>>> +
>>>>> +    /* Bypass AF_NETLINK. */
>>>>> +    sk = ctx->sk;
>>>>> +    if (sk && sk->family == AF_NETLINK)
>>>>> +        return -1;
>>>>> +
>>>>> +    /* Make sure bpf_get_netns_cookie is callable.
>>>>> +     */
>>>>> +    if (bpf_get_netns_cookie(NULL) == 0)
>>>>> +        return 0;
>>>>> +
>>>>> +    if (bpf_get_netns_cookie(ctx) == 0)
>>>>> +        return 0;
>>>>> +
>>>>> +    if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
>>>>> +        /* Not interested in SOL_IP:IP_TOS;
>>>>> +         * let next BPF program in the cgroup chain or kernel
>>>>> +         * handle it.
>>>>> +         */
>>>>> +        ctx->optlen = 0; /* bypass optval>PAGE_SIZE */
>>>>> +        return 1;
>>>>> +    }
>>>>> +
>>>>> +    if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
>>>>> +        /* Overwrite SO_SNDBUF value */
>>>>> +
>>>>> +        ret = bpf_sockopt_dynptr_alloc(ctx, sizeof(__u32),
>>>>> +                           &optval_buf);
>>>>> +        if (ret < 0)
>>>>> +            bpf_sockopt_dynptr_release(ctx, &optval_buf);
>>>>> +        else {
>>>>> +            tmp = 0x55AA;
>>>>> +            bpf_dynptr_write(&optval_buf, 0, &tmp, sizeof(tmp), 0);
>>>>> +            ret = bpf_sockopt_dynptr_install(ctx, &optval_buf);
>>>>
>>>> One thing I'm still slightly confused about is
>>>> bpf_sockopt_dynptr_install vs bpf_sockopt_dynptr_copy_to. I do
>>>> understand that it comes from getsockopt vs setsockopt (and the 
>>>> ability,
>>>> in setsockopt, to allocate larger buffers), but I wonder whether
>>>> we can hide everything under single bpf_sockopt_dynptr_copy_to call?
>>>>
>>>> For getsockopt, it stays as is. For setsockopt, it would work like
>>>> _install currently does. Would that work? From user perspective,
>>>> if we provide a simple call that does the right thing, seems a bit
>>>> more usable? The only problem is probably the fact that _install
>>>> explicitly moves the ownership, but I don't see why copy_to can't
>>>> have the same "consume" semantics?
>>
>> Sorry for missing this part!
>> This overloading is counterintuitive for me.
>> *_copy_to() will not copy/overwrite the buffer, but replace the buffer
>> instead. And, it will change its side-effects according to its context.
>> I would prefer a different name instead of reusing *_copy_to().
>>
>> We probably need a better name, instead of copy_to, in order to merge
>> these two functions if we want to.
> 
> It also took me some time to realize the alloc/install/copy_to/release 
> usages. iiuc, to change optval in getsockopt, it is 
> alloc=>write=>copy_to=>release.
>        to change optval in setsockopt, it is alloc=>write=>install.
> 
> Can this alloc and user-or-kernel memory details be done in the 
> bpf_dynptr_write() instead such that both BPF_CGROUP_GETSOCKOPT and 
> BPF_CGROUP_SETSOCKOPT program can just call one bpf_dynptr_write() to 
> update the optval? I meant bpf_dynptr_write() should have the needed 
> context to decide if it can directly write to the __user ptr or it needs 
> to write to a kernel memory and if kmalloc is needed/allowed.

For v3, you can just call  bpf_dynptr_write() to update the buffer
without installing it if optval is already a in kernel buffer.
But, you can still call *_install() even if it is unnecessary.
The code in the test case try to make it less complicated. So,
it always call *_install() without any check.
> 
> Same for reading. bpf_dynptr_read() should know it should read from user 
> or kernel memory. bpf_dynptr_data() may not work well if the underlying 
> sockopt memory is still backed by user memory, so probably don't support 
> bpf_dynptr_data() for sockopt. Support the bpf_dynptr_slice() and 
> bpf_dynptr_slice_rdwr() instead. Take a look at how the "buffer__opt" 
> arg is used by the xdp dynptr. If the underlying sockopt is still backed 
> by user memory, the bpf_dynptr_slice() can do a copy_from_user to the 
> "buffer__opt".

> 
> bpf_dynptr_from_cgrp_sockopt() should be the only function to initalize 
> a dynptr from the 'struct bpf_sockopt *ctx'. Probably don't need to 
> allocate any memory at the init time.
> 
> Is there something specific to cgrp sockopt that is difficult and any 
> downside of the above?
> 
> WDYT?
> 

This is doable.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-17  1:25   ` Alexei Starovoitov
@ 2023-08-17 19:00     ` Kui-Feng Lee
  2023-08-17 19:43       ` Alexei Starovoitov
  2023-08-17 20:41     ` Martin KaFai Lau
  1 sibling, 1 reply; 28+ messages in thread
From: Kui-Feng Lee @ 2023-08-17 19:00 UTC (permalink / raw)
  To: Alexei Starovoitov, thinker.li
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf,
	yonghong.song, kuifeng



On 8/16/23 18:25, Alexei Starovoitov wrote:
> On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote:
>>   
>> +BTF_SET8_START(cgroup_common_btf_ids)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
>> +BTF_SET8_END(cgroup_common_btf_ids)
> 
> These shouldn't be sockopt specific.
> If we want dynptr to represent a pointer to a user contiguous user memory
> we should use generic kfunc that do so.
> 
> I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do.
> New dynptr type can be hidden in the kernel and all existing
> kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work
> with user memory.
> 
> But I think we have to step back. Why do we need this whole thing in the first place?
> _why_ sockopt bpf progs needs to read and write user memory?
> 
> Yes there is one page limit, but what is the use case to actually read and write
> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
> anything useful with iptables binary blobs. They are hard enough for kernel to parse.

The ideal behind the design is let the developers of filters to decide
when to replace the existing buffer.  And, access the content of
buffers just like accessing raw pointers. However, seems almost everyone
love to use *_read() & *_write(). I will move to that direction.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-17 19:00     ` Kui-Feng Lee
@ 2023-08-17 19:43       ` Alexei Starovoitov
  2023-08-18  0:14         ` Kui-Feng Lee
  0 siblings, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2023-08-17 19:43 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Kernel Team, Andrii Nakryiko, Stanislav Fomichev, Yonghong Song,
	Kui-Feng Lee

On Thu, Aug 17, 2023 at 12:00 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 8/16/23 18:25, Alexei Starovoitov wrote:
> > On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote:
> >>
> >> +BTF_SET8_START(cgroup_common_btf_ids)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
> >> +BTF_SET8_END(cgroup_common_btf_ids)
> >
> > These shouldn't be sockopt specific.
> > If we want dynptr to represent a pointer to a user contiguous user memory
> > we should use generic kfunc that do so.
> >
> > I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do.
> > New dynptr type can be hidden in the kernel and all existing
> > kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work
> > with user memory.
> >
> > But I think we have to step back. Why do we need this whole thing in the first place?
> > _why_ sockopt bpf progs needs to read and write user memory?
> >
> > Yes there is one page limit, but what is the use case to actually read and write
> > beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
> > anything useful with iptables binary blobs. They are hard enough for kernel to parse.
>
> The ideal behind the design is let the developers of filters to decide
> when to replace the existing buffer.  And, access the content of
> buffers just like accessing raw pointers. However, seems almost everyone
> love to use *_read() & *_write(). I will move to that direction.

That doesn't answer my question about the use case.
What kind of filters want to parse more than 4k of sockopt data?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-17  1:25   ` Alexei Starovoitov
  2023-08-17 19:00     ` Kui-Feng Lee
@ 2023-08-17 20:41     ` Martin KaFai Lau
  2023-08-17 21:37       ` Yonghong Song
  2023-08-17 21:46       ` Alexei Starovoitov
  1 sibling, 2 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2023-08-17 20:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, ast, song, kernel-team, andrii, sdf, yonghong.song,
	sinquersw, kuifeng, thinker.li

On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
> But I think we have to step back. Why do we need this whole thing in the first place?
> _why_  sockopt bpf progs needs to read and write user memory?
> 
> Yes there is one page limit, but what is the use case to actually read and write
> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
> anything useful with iptables binary blobs. They are hard enough for kernel to parse.

Usually the bpf prog is only interested in a very small number of optnames and 
no need to read the optval at all for most cases. The max size for our use cases 
is 16 bytes. The kernel currently is kind of doing it the opposite and always 
assumes the bpf prog needing to use the optval, allocate kernel memory and 
copy_from_user such that the non-sleepable bpf program can read/write it.

The bpf prog usually checks the optname and then just returns for most cases:

SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
	if (ctx->optname != MY_INTERNAL_SOCKOPT)
		return 1;

	/* change the ctx->optval and return to user space ... */
}

When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and 
copy the first PAGE_SIZE data from the user optval. We used to ask the bpf prog 
to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not looked at 
the optval. Otherwise, the kernel used to conclude that the bpf prog had set an 
invalid optlen because optlen is larger than the optval_end - optval and 
returned -EFAULT incorrectly to the end-user.

The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an 
internal kernel PAGE_SIZE limitation:

SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
	if (ctx->optname != MY_INTERNAL_SOCKOPT) {
		/* only do that for ctx->optlen > PAGE_SIZE.
		 * otherwise, the following cgroup bpf prog will
		 * not be able to use the optval that it may
		 * be interested.
		 */
		if (ctx->optlen > PAGE_SIZE)
			ctx->optlen = 0;
		return 1;
	}

	/* change the ctx->optval and return to user space ... */
}

The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT for 
{g,s}setsockopt with wrong optlen").

Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the 
kernel allows a user passing NULL optval and using the optlen returned by 
getsockopt to learn the buffer space required. The bpf prog then needs to remove 
the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames that it is 
not interested while risking the following cgroup prog may not be able to use 
some of the optval:

SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
	if (ctx->optname != MY_INTERNAL_SOCKOPT) {

		/* Do that for all optname that you are not interested.
		 * The latter cgroup bpf will not be able to use the optval.
		 */
	 	ctx->optlen = 0;
		return 1;
	}

	/* chage the ctx->optval and return to user space ... */
}

The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT for 
getsockopt with optval=NULL").

To avoid other potential optname cases that may run into similar situation and 
requires the bpf prog work around it again like the above, it needs a way to 
track whether a bpf prog has changed the optval in runtime. If it is not 
changed, use the result from the kernel set/getsockopt. If reading/writing to 
optval is done through a kfunc, this can be tracked. The kfunc can also directly 
read/write the user memory in optval, avoid the pre-alloc kernel memory and the 
PAGE_SIZE limit but this is a minor point.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-17 20:41     ` Martin KaFai Lau
@ 2023-08-17 21:37       ` Yonghong Song
  2023-08-17 22:56         ` Martin KaFai Lau
  2023-08-17 21:46       ` Alexei Starovoitov
  1 sibling, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2023-08-17 21:37 UTC (permalink / raw)
  To: Martin KaFai Lau, Alexei Starovoitov
  Cc: bpf, ast, song, kernel-team, andrii, sdf, sinquersw, kuifeng,
	thinker.li



On 8/17/23 1:41 PM, Martin KaFai Lau wrote:
> On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
>> But I think we have to step back. Why do we need this whole thing in 
>> the first place?
>> _why_  sockopt bpf progs needs to read and write user memory?
>>
>> Yes there is one page limit, but what is the use case to actually read 
>> and write
>> beyond that? iptables sockopt was mentioned, but I don't think bpf 
>> prog can do
>> anything useful with iptables binary blobs. They are hard enough for 
>> kernel to parse.
> 
> Usually the bpf prog is only interested in a very small number of 
> optnames and no need to read the optval at all for most cases. The max 
> size for our use cases is 16 bytes. The kernel currently is kind of 
> doing it the opposite and always assumes the bpf prog needing to use the 
> optval, allocate kernel memory and copy_from_user such that the 
> non-sleepable bpf program can read/write it.
> 
> The bpf prog usually checks the optname and then just returns for most 
> cases:
> 
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>      if (ctx->optname != MY_INTERNAL_SOCKOPT)
>          return 1;
> 
>      /* change the ctx->optval and return to user space ... */
> }
> 
> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE 
> memory and copy the first PAGE_SIZE data from the user optval. We used 
> to ask the bpf prog to explicitly set the optlen to 0 for > PAGE_SIZE 
> case even it has not looked at the optval. Otherwise, the kernel used to 
> conclude that the bpf prog had set an invalid optlen because optlen is 
> larger than the optval_end - optval and returned -EFAULT incorrectly to 
> the end-user.
> 
> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due 
> to an internal kernel PAGE_SIZE limitation:
> 
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>          /* only do that for ctx->optlen > PAGE_SIZE.
>           * otherwise, the following cgroup bpf prog will
>           * not be able to use the optval that it may
>           * be interested.
>           */
>          if (ctx->optlen > PAGE_SIZE)
>              ctx->optlen = 0;
>          return 1;
>      }
> 
>      /* change the ctx->optval and return to user space ... */
> }
> 
> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't 
> EFAULT for {g,s}setsockopt with wrong optlen").
> 
> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that 
> the kernel allows a user passing NULL optval and using the optlen 
> returned by getsockopt to learn the buffer space required. The bpf prog 
> then needs to remove the optlen > PAGE_SIZE check and set optlen to 0 
> for _all_ optnames that it is not interested while risking the following 
> cgroup prog may not be able to use some of the optval:
> 
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
> 
>          /* Do that for all optname that you are not interested.
>           * The latter cgroup bpf will not be able to use the optval.
>           */
>           ctx->optlen = 0;
>          return 1;
>      }
> 
>      /* chage the ctx->optval and return to user space ... */
> }
> 
> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't 
> EFAULT for getsockopt with optval=NULL").
> 
> To avoid other potential optname cases that may run into similar 
> situation and requires the bpf prog work around it again like the above, 
> it needs a way to track whether a bpf prog has changed the optval in 
> runtime. If it is not changed, use the result from the kernel 

Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it
if optval is changed?

struct bpf_sockopt {
         __bpf_md_ptr(struct bpf_sock *, sk);
         __bpf_md_ptr(void *, optval);
         __bpf_md_ptr(void *, optval_end);

         __s32   level;
         __s32   optname;
         __s32   optlen;
         __s32   retval;
};

> set/getsockopt. If reading/writing to optval is done through a kfunc, 
> this can be tracked. The kfunc can also directly read/write the user 
> memory in optval, avoid the pre-alloc kernel memory and the PAGE_SIZE 
> limit but this is a minor point.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-17 20:41     ` Martin KaFai Lau
  2023-08-17 21:37       ` Yonghong Song
@ 2023-08-17 21:46       ` Alexei Starovoitov
  2023-08-17 22:45         ` Martin KaFai Lau
  1 sibling, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2023-08-17 21:46 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Song Liu, Kernel Team, Andrii Nakryiko,
	Stanislav Fomichev, Yonghong Song, Kui-Feng Lee, Kui-Feng Lee,
	Kui-Feng Lee

On Thu, Aug 17, 2023 at 1:41 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
> > But I think we have to step back. Why do we need this whole thing in the first place?
> > _why_  sockopt bpf progs needs to read and write user memory?
> >
> > Yes there is one page limit, but what is the use case to actually read and write
> > beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
> > anything useful with iptables binary blobs. They are hard enough for kernel to parse.
>
> Usually the bpf prog is only interested in a very small number of optnames and
> no need to read the optval at all for most cases. The max size for our use cases
> is 16 bytes. The kernel currently is kind of doing it the opposite and always
> assumes the bpf prog needing to use the optval, allocate kernel memory and
> copy_from_user such that the non-sleepable bpf program can read/write it.
>
> The bpf prog usually checks the optname and then just returns for most cases:
>
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>         if (ctx->optname != MY_INTERNAL_SOCKOPT)
>                 return 1;
>
>         /* change the ctx->optval and return to user space ... */
> }
>
> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and
> copy the first PAGE_SIZE data from the user optval. We used to ask the bpf prog
> to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not looked at
> the optval. Otherwise, the kernel used to conclude that the bpf prog had set an
> invalid optlen because optlen is larger than the optval_end - optval and
> returned -EFAULT incorrectly to the end-user.
>
> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an
> internal kernel PAGE_SIZE limitation:
>
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>         if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>                 /* only do that for ctx->optlen > PAGE_SIZE.
>                  * otherwise, the following cgroup bpf prog will
>                  * not be able to use the optval that it may
>                  * be interested.
>                  */
>                 if (ctx->optlen > PAGE_SIZE)
>                         ctx->optlen = 0;
>                 return 1;
>         }
>
>         /* change the ctx->optval and return to user space ... */
> }
>
> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT for
> {g,s}setsockopt with wrong optlen").
>
> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the
> kernel allows a user passing NULL optval and using the optlen returned by
> getsockopt to learn the buffer space required. The bpf prog then needs to remove
> the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames that it is
> not interested while risking the following cgroup prog may not be able to use
> some of the optval:
>
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>         if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>
>                 /* Do that for all optname that you are not interested.
>                  * The latter cgroup bpf will not be able to use the optval.
>                  */
>                 ctx->optlen = 0;
>                 return 1;
>         }
>
>         /* chage the ctx->optval and return to user space ... */
> }
>
> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT for
> getsockopt with optval=NULL").

Agree with all of the above.
Existing bpf sockopt interfaces was problematic,
but with these workarounds we fixed all known issues.

> To avoid other potential optname cases that may run into similar situation and
> requires the bpf prog work around it again like the above, it needs a way to
> track whether a bpf prog has changed the optval in runtime. If it is not
> changed, use the result from the kernel set/getsockopt. If reading/writing to
> optval is done through a kfunc, this can be tracked. The kfunc can also directly
> read/write the user memory in optval, avoid the pre-alloc kernel memory and the
> PAGE_SIZE limit but this is a minor point.

so I'm still not following what sleepable progs that can access everything
would help the existing situation.
I agree that sleepable bpf sockopt should be free from old mistakes,
but people might still write old-style non-sleeptable bpf sockopt and
might repeat the same mistakes.
I'm missing the value of the new interface. It's better, sure, but why?
Do we expect to rewrite existing sockopt progs in sleepable way?
It might not be easy due to sleepable limitations like maps and whatever else.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-17 21:46       ` Alexei Starovoitov
@ 2023-08-17 22:45         ` Martin KaFai Lau
  0 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2023-08-17 22:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Song Liu, Kernel Team, Andrii Nakryiko,
	Stanislav Fomichev, Yonghong Song, Kui-Feng Lee, Kui-Feng Lee,
	Kui-Feng Lee

On 8/17/23 2:46 PM, Alexei Starovoitov wrote:
>> To avoid other potential optname cases that may run into similar situation and
>> requires the bpf prog work around it again like the above, it needs a way to
>> track whether a bpf prog has changed the optval in runtime. If it is not
>> changed, use the result from the kernel set/getsockopt. If reading/writing to
>> optval is done through a kfunc, this can be tracked. The kfunc can also directly
>> read/write the user memory in optval, avoid the pre-alloc kernel memory and the
>> PAGE_SIZE limit but this is a minor point.
> so I'm still not following what sleepable progs that can access everything
> would help the existing situation.
> I agree that sleepable bpf sockopt should be free from old mistakes,
> but people might still write old-style non-sleeptable bpf sockopt and
> might repeat the same mistakes.
> I'm missing the value of the new interface. It's better, sure, but why?
> Do we expect to rewrite existing sockopt progs in sleepable way?
> It might not be easy due to sleepable limitations like maps and whatever else.

so far our sockopt progs only uses sk local storage that can support sleepable 
and we can all move to the sleepable way to avoid any future quirk.

Agree that others may have sockopt prog that has hard dependency on 
non-sleepable. If the existing non-sleepable and sleepable inter-leaved 
together, it would end up hitting similar issue.

Lets drop the idea of this set.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-17 21:37       ` Yonghong Song
@ 2023-08-17 22:56         ` Martin KaFai Lau
  0 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2023-08-17 22:56 UTC (permalink / raw)
  To: yonghong.song
  Cc: bpf, ast, song, kernel-team, andrii, sdf, sinquersw, kuifeng,
	thinker.li, Alexei Starovoitov

On 8/17/23 2:37 PM, Yonghong Song wrote:
> 
> 
> On 8/17/23 1:41 PM, Martin KaFai Lau wrote:
>> On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
>>> But I think we have to step back. Why do we need this whole thing in the 
>>> first place?
>>> _why_  sockopt bpf progs needs to read and write user memory?
>>>
>>> Yes there is one page limit, but what is the use case to actually read and write
>>> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
>>> anything useful with iptables binary blobs. They are hard enough for kernel 
>>> to parse.
>>
>> Usually the bpf prog is only interested in a very small number of optnames and 
>> no need to read the optval at all for most cases. The max size for our use 
>> cases is 16 bytes. The kernel currently is kind of doing it the opposite and 
>> always assumes the bpf prog needing to use the optval, allocate kernel memory 
>> and copy_from_user such that the non-sleepable bpf program can read/write it.
>>
>> The bpf prog usually checks the optname and then just returns for most cases:
>>
>> SEC("cgroup/getsockopt")
>> int get_internal_sockopt(struct bpf_sockopt *ctx)
>> {
>>      if (ctx->optname != MY_INTERNAL_SOCKOPT)
>>          return 1;
>>
>>      /* change the ctx->optval and return to user space ... */
>> }
>>
>> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and 
>> copy the first PAGE_SIZE data from the user optval. We used to ask the bpf 
>> prog to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not 
>> looked at the optval. Otherwise, the kernel used to conclude that the bpf prog 
>> had set an invalid optlen because optlen is larger than the optval_end - 
>> optval and returned -EFAULT incorrectly to the end-user.
>>
>> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an 
>> internal kernel PAGE_SIZE limitation:
>>
>> SEC("cgroup/getsockopt")
>> int get_internal_sockopt(struct bpf_sockopt *ctx)
>> {
>>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>>          /* only do that for ctx->optlen > PAGE_SIZE.
>>           * otherwise, the following cgroup bpf prog will
>>           * not be able to use the optval that it may
>>           * be interested.
>>           */
>>          if (ctx->optlen > PAGE_SIZE)
>>              ctx->optlen = 0;
>>          return 1;
>>      }
>>
>>      /* change the ctx->optval and return to user space ... */
>> }
>>
>> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT 
>> for {g,s}setsockopt with wrong optlen").
>>
>> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the 
>> kernel allows a user passing NULL optval and using the optlen returned by 
>> getsockopt to learn the buffer space required. The bpf prog then needs to 
>> remove the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames 
>> that it is not interested while risking the following cgroup prog may not be 
>> able to use some of the optval:
>>
>> SEC("cgroup/getsockopt")
>> int get_internal_sockopt(struct bpf_sockopt *ctx)
>> {
>>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>>
>>          /* Do that for all optname that you are not interested.
>>           * The latter cgroup bpf will not be able to use the optval.
>>           */
>>           ctx->optlen = 0;
>>          return 1;
>>      }
>>
>>      /* chage the ctx->optval and return to user space ... */
>> }
>>
>> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT 
>> for getsockopt with optval=NULL").
>>
>> To avoid other potential optname cases that may run into similar situation and 
>> requires the bpf prog work around it again like the above, it needs a way to 
>> track whether a bpf prog has changed the optval in runtime. If it is not 
>> changed, use the result from the kernel 
> 
> Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it
> if optval is changed?

This new interface should work. If there is an old-existing prog staying with 
the old interface (didn't set this bool but changed the optval) in the cgroup 
prog array, it probably end up not improving anything also?

or the verifier can enforce setting this bool in runtime when writing to optval? 
do not know how demanding the verifier change is. I am not sure if this would be 
an overkill for the verifier.

> 
> struct bpf_sockopt {
>          __bpf_md_ptr(struct bpf_sock *, sk);
>          __bpf_md_ptr(void *, optval);
>          __bpf_md_ptr(void *, optval_end);
> 
>          __s32   level;
>          __s32   optname;
>          __s32   optlen;
>          __s32   retval;
> };
> 
>> set/getsockopt. If reading/writing to optval is done through a kfunc, this can 
>> be tracked. The kfunc can also directly read/write the user memory in optval, 
>> avoid the pre-alloc kernel memory and the PAGE_SIZE limit but this is a minor 
>> point.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-08-17 19:43       ` Alexei Starovoitov
@ 2023-08-18  0:14         ` Kui-Feng Lee
  0 siblings, 0 replies; 28+ messages in thread
From: Kui-Feng Lee @ 2023-08-18  0:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kui-Feng Lee, bpf, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Kernel Team, Andrii Nakryiko, Stanislav Fomichev, Yonghong Song,
	Kui-Feng Lee



On 8/17/23 12:43, Alexei Starovoitov wrote:
> On Thu, Aug 17, 2023 at 12:00 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 8/16/23 18:25, Alexei Starovoitov wrote:
>>> On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote:
>>>>
>>>> +BTF_SET8_START(cgroup_common_btf_ids)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
>>>> +BTF_SET8_END(cgroup_common_btf_ids)
>>>
>>> These shouldn't be sockopt specific.
>>> If we want dynptr to represent a pointer to a user contiguous user memory
>>> we should use generic kfunc that do so.
>>>
>>> I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do.
>>> New dynptr type can be hidden in the kernel and all existing
>>> kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work
>>> with user memory.
>>>
>>> But I think we have to step back. Why do we need this whole thing in the first place?
>>> _why_ sockopt bpf progs needs to read and write user memory?
>>>
>>> Yes there is one page limit, but what is the use case to actually read and write
>>> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
>>> anything useful with iptables binary blobs. They are hard enough for kernel to parse.
>>
>> The ideal behind the design is let the developers of filters to decide
>> when to replace the existing buffer.  And, access the content of
>> buffers just like accessing raw pointers. However, seems almost everyone
>> love to use *_read() & *_write(). I will move to that direction.
> 
> That doesn't answer my question about the use case.
> What kind of filters want to parse more than 4k of sockopt data?

I don't have any existing use case for this. It is just something
can/allow to happen.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-08-18  0:15 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-15 17:47 [RFC bpf-next v3 0/5] Sleepable BPF programs on cgroup {get,set}sockopt thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt thinker.li
2023-08-15 20:58   ` Stanislav Fomichev
2023-08-15 21:04     ` Kui-Feng Lee
2023-08-16  0:42   ` kernel test robot
2023-08-15 17:47 ` [RFC bpf-next v3 2/5] libbpf: add sleepable sections for {get,set}sockopt() thinker.li
2023-08-15 17:47 ` [RFC bpf-next v3 3/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval thinker.li
2023-08-17  0:55   ` Martin KaFai Lau
2023-08-17 18:10     ` Kui-Feng Lee
2023-08-17  1:17   ` Alexei Starovoitov
2023-08-17 18:12     ` Kui-Feng Lee
2023-08-15 17:47 ` [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT thinker.li
2023-08-16 14:19   ` kernel test robot
2023-08-17  1:25   ` Alexei Starovoitov
2023-08-17 19:00     ` Kui-Feng Lee
2023-08-17 19:43       ` Alexei Starovoitov
2023-08-18  0:14         ` Kui-Feng Lee
2023-08-17 20:41     ` Martin KaFai Lau
2023-08-17 21:37       ` Yonghong Song
2023-08-17 22:56         ` Martin KaFai Lau
2023-08-17 21:46       ` Alexei Starovoitov
2023-08-17 22:45         ` Martin KaFai Lau
2023-08-15 17:47 ` [RFC bpf-next v3 5/5] selftests/bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type thinker.li
2023-08-15 20:57   ` Stanislav Fomichev
2023-08-15 23:37     ` Kui-Feng Lee
2023-08-16  0:03       ` Kui-Feng Lee
2023-08-17  1:13         ` Martin KaFai Lau
2023-08-17 18:36           ` Kui-Feng Lee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.