[PATCH bpf-next 0/7] Transit between BPF TCP congestion controls.

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/7] Transit between BPF TCP congestion controls.
@ 2023-02-14 22:17 Kui-Feng Lee
  2023-02-14 22:17 ` [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-14 22:17 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: Kui-Feng Lee

Major changes:

 - Create bpf_links in the kernel for BPF struct_ops to register and
   unregister it.

 - Enables switching between implementations of bpf-tcp-cc under a
   name instantly by replacing the backing struct_ops map of a
   bpf_link.

Previously, BPF struct_ops didn't go off, as even when the user
program creating it was terminated, none of these ever were pinned.
For instance, the TCP congestion control subsystem indirectly
maintains a reference count on the struct_ops of any registered BPF
implemented algorithm. Thus, the algorithm won't be deactivated until
someone deliberately unregisters it.  For compatibility with other BPF
programs, bpf_links have been created to work in coordination with
struct_ops maps. This ensures that the registration and unregistration
of these respective maps is carried out at the start and end of the
bpf_link.

We also faced complications when attempting to replace an existing TCP
congestion control algorithm with a new implementation on the fly. A
struct_ops map was used to register a TCP congestion control algorithm
with a unique name.  We had to either register the alternative
implementation with a new name and move over or unregister the current
one before being able to reregistration with the same name.  To fix
this problem, we can an option to migrate the registration of the
algorithm from struct_ops maps to bpf_links. By modifying the backing
map of a bpf_link, it suddenly becomes possible to replace an existing
TCP congestion control algorithm with ease.

Kui-Feng Lee (7):
  bpf: Create links for BPF struct_ops maps.
  net: Update an existing TCP congestion control algorithm.
  bpf: Register and unregister a struct_ops by their bpf_links.
  libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  bpf: Update the struct_ops of a bpf_link.
  libbpf: Update a bpf_link with another struct_ops.
  selftests/bpf: Test switching TCP Congestion Control algorithms.

 include/linux/bpf.h                           |   8 +-
 include/net/tcp.h                             |   2 +
 include/uapi/linux/bpf.h                      |  15 +-
 kernel/bpf/bpf_struct_ops.c                   | 176 +++++++++++++++++-
 kernel/bpf/syscall.c                          |  60 +++++-
 net/bpf/bpf_dummy_struct_ops.c                |   6 +
 net/ipv4/bpf_tcp_ca.c                         |   8 +-
 net/ipv4/tcp_cong.c                           |  39 ++++
 tools/include/uapi/linux/bpf.h                |   7 +
 tools/lib/bpf/bpf.c                           |   2 +
 tools/lib/bpf/libbpf.c                        | 109 +++++++++--
 tools/lib/bpf/libbpf.h                        |   1 +
 tools/lib/bpf/libbpf.map                      |   1 +
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |  48 +++++
 .../selftests/bpf/progs/tcp_ca_update.c       |  75 ++++++++
 15 files changed, 526 insertions(+), 31 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tcp_ca_update.c

-- 
2.30.2

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-14 22:17 [PATCH bpf-next 0/7] Transit between BPF TCP congestion controls Kui-Feng Lee
@ 2023-02-14 22:17 ` Kui-Feng Lee
  2023-02-15  0:26   ` kernel test robot
                     ` (2 more replies)
  2023-02-14 22:17 ` [PATCH bpf-next 2/7] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-14 22:17 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: Kui-Feng Lee

BPF struct_ops maps are employed directly to register TCP Congestion
Control algorithms. Unlike other BPF programs that terminate when
their links gone, the struct_ops program reduces its refcount solely
upon death of its FD. The link of a BPF struct_ops map provides a
uniform experience akin to other types of BPF programs.  A TCP
Congestion Control algorithm will be unregistered upon destroying the
FD in the following patches.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf.h            |  6 +++-
 include/uapi/linux/bpf.h       |  4 +++
 kernel/bpf/bpf_struct_ops.c    | 66 ++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           | 29 ++++++++++-----
 tools/include/uapi/linux/bpf.h |  4 +++
 tools/lib/bpf/bpf.c            |  2 ++
 tools/lib/bpf/libbpf.c         |  1 +
 7 files changed, 102 insertions(+), 10 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8b5d0b4c4ada..13683584b071 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1391,7 +1391,11 @@ struct bpf_link {
 	u32 id;
 	enum bpf_link_type type;
 	const struct bpf_link_ops *ops;
-	struct bpf_prog *prog;
+	union {
+		struct bpf_prog *prog;
+		/* Backed by a struct_ops (type == BPF_LINK_UPDATE_STRUCT_OPS) */
+		struct bpf_map *map;
+	};
 	struct work_struct work;
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 17afd2b35ee5..1e6cdd0f355d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1033,6 +1033,7 @@ enum bpf_attach_type {
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
 	BPF_LSM_CGROUP,
+	BPF_STRUCT_OPS_MAP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -6354,6 +6355,9 @@ struct bpf_link_info {
 		struct {
 			__u32 ifindex;
 		} xdp;
+		struct {
+			__u32 map_id;
+		} struct_ops_map;
 	};
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index ece9870cab68..621c8e24481a 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -698,3 +698,69 @@ void bpf_struct_ops_put(const void *kdata)
 		call_rcu(&st_map->rcu, bpf_struct_ops_put_rcu);
 	}
 }
+
+static void bpf_struct_ops_map_link_release(struct bpf_link *link)
+{
+	if (link->map) {
+		bpf_map_put(link->map);
+		link->map = NULL;
+	}
+}
+
+static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
+{
+	kfree(link);
+}
+
+static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
+					    struct seq_file *seq)
+{
+	seq_printf(seq, "map_id:\t%d\n",
+		  link->map->id);
+}
+
+static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
+					       struct bpf_link_info *info)
+{
+	info->struct_ops_map.map_id = link->map->id;
+	return 0;
+}
+
+static const struct bpf_link_ops bpf_struct_ops_map_lops = {
+	.release = bpf_struct_ops_map_link_release,
+	.dealloc = bpf_struct_ops_map_link_dealloc,
+	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
+	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
+};
+
+int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
+{
+	struct bpf_link_primer link_primer;
+	struct bpf_map *map;
+	struct bpf_link *link = NULL;
+	int err;
+
+	map = bpf_map_get(attr->link_create.prog_fd);
+	if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
+		return -EINVAL;
+
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+	bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
+	link->map = map;
+
+	err = bpf_link_prime(link, &link_primer);
+	if (err)
+		goto err_out;
+
+	return bpf_link_settle(&link_primer);
+
+err_out:
+	bpf_map_put(map);
+	kfree(link);
+	return err;
+}
+
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cda8d00f3762..54e172d8f5d1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2738,7 +2738,9 @@ static void bpf_link_free(struct bpf_link *link)
 	if (link->prog) {
 		/* detach BPF program, clean up used resources */
 		link->ops->release(link);
-		bpf_prog_put(link->prog);
+		if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
+			bpf_prog_put(link->prog);
+		/* The struct_ops links clean up map by them-selves. */
 	}
 	/* free bpf_link and its containing memory */
 	link->ops->dealloc(link);
@@ -2794,16 +2796,19 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
 	const struct bpf_prog *prog = link->prog;
 	char prog_tag[sizeof(prog->tag) * 2 + 1] = { };
 
-	bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
 	seq_printf(m,
 		   "link_type:\t%s\n"
-		   "link_id:\t%u\n"
-		   "prog_tag:\t%s\n"
-		   "prog_id:\t%u\n",
+		   "link_id:\t%u\n",
 		   bpf_link_type_strs[link->type],
-		   link->id,
-		   prog_tag,
-		   prog->aux->id);
+		   link->id);
+	if (link->type != BPF_LINK_TYPE_STRUCT_OPS) {
+		bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
+		seq_printf(m,
+			   "prog_tag:\t%s\n"
+			   "prog_id:\t%u\n",
+			   prog_tag,
+			   prog->aux->id);
+	}
 	if (link->ops->show_fdinfo)
 		link->ops->show_fdinfo(link, m);
 }
@@ -4278,7 +4283,8 @@ static int bpf_link_get_info_by_fd(struct file *file,
 
 	info.type = link->type;
 	info.id = link->id;
-	info.prog_id = link->prog->aux->id;
+	if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
+		info.prog_id = link->prog->aux->id;
 
 	if (link->ops->fill_link_info) {
 		err = link->ops->fill_link_info(link, &info);
@@ -4531,6 +4537,8 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
 	return err;
 }
 
+extern int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr);
+
 #define BPF_LINK_CREATE_LAST_FIELD link_create.kprobe_multi.cookies
 static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 {
@@ -4541,6 +4549,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	if (CHECK_ATTR(BPF_LINK_CREATE))
 		return -EINVAL;
 
+	if (attr->link_create.attach_type == BPF_STRUCT_OPS_MAP)
+		return link_create_struct_ops_map(attr, uattr);
+
 	prog = bpf_prog_get(attr->link_create.prog_fd);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 17afd2b35ee5..1e6cdd0f355d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1033,6 +1033,7 @@ enum bpf_attach_type {
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
 	BPF_LSM_CGROUP,
+	BPF_STRUCT_OPS_MAP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -6354,6 +6355,9 @@ struct bpf_link_info {
 		struct {
 			__u32 ifindex;
 		} xdp;
+		struct {
+			__u32 map_id;
+		} struct_ops_map;
 	};
 } __attribute__((aligned(8)));
 
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 9aff98f42a3d..e44d49f17c86 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -731,6 +731,8 @@ int bpf_link_create(int prog_fd, int target_fd,
 		if (!OPTS_ZEROED(opts, tracing))
 			return libbpf_err(-EINVAL);
 		break;
+	case BPF_STRUCT_OPS_MAP:
+		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 35a698eb825d..75ed95b7e455 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -115,6 +115,7 @@ static const char * const attach_type_name[] = {
 	[BPF_SK_REUSEPORT_SELECT_OR_MIGRATE]	= "sk_reuseport_select_or_migrate",
 	[BPF_PERF_EVENT]		= "perf_event",
 	[BPF_TRACE_KPROBE_MULTI]	= "trace_kprobe_multi",
+	[BPF_STRUCT_OPS_MAP]		= "struct_ops_map",
 };
 
 static const char * const link_type_name[] = {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH bpf-next 2/7] net: Update an existing TCP congestion control algorithm.
  2023-02-14 22:17 [PATCH bpf-next 0/7] Transit between BPF TCP congestion controls Kui-Feng Lee
  2023-02-14 22:17 ` [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
@ 2023-02-14 22:17 ` Kui-Feng Lee
  2023-02-15  2:43   ` Stanislav Fomichev
  2023-02-14 22:17 ` [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links Kui-Feng Lee
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-14 22:17 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: Kui-Feng Lee

This feature lets you immediately transition to another congestion
control algorithm or implementation with the same name.  Once a name
is updated, new connections will apply this new algorithm.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf.h            |  1 +
 include/net/tcp.h              |  2 ++
 net/bpf/bpf_dummy_struct_ops.c |  6 ++++++
 net/ipv4/bpf_tcp_ca.c          |  6 ++++++
 net/ipv4/tcp_cong.c            | 39 ++++++++++++++++++++++++++++++++++
 5 files changed, 54 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 13683584b071..5fe39f56a760 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1450,6 +1450,7 @@ struct bpf_struct_ops {
 			   void *kdata, const void *udata);
 	int (*reg)(void *kdata);
 	void (*unreg)(void *kdata);
+	int (*update)(void *kdata, void *old_kdata);
 	const struct btf_type *type;
 	const struct btf_type *value_type;
 	const char *name;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index db9f828e9d1e..239cc0e2639c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1117,6 +1117,8 @@ struct tcp_congestion_ops {
 
 int tcp_register_congestion_control(struct tcp_congestion_ops *type);
 void tcp_unregister_congestion_control(struct tcp_congestion_ops *type);
+int tcp_update_congestion_control(struct tcp_congestion_ops *type,
+				  struct tcp_congestion_ops *old_type);
 
 void tcp_assign_congestion_control(struct sock *sk);
 void tcp_init_congestion_control(struct sock *sk);
diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c
index ff4f89a2b02a..158f14e240d0 100644
--- a/net/bpf/bpf_dummy_struct_ops.c
+++ b/net/bpf/bpf_dummy_struct_ops.c
@@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata)
 {
 }
 
+static int bpf_dummy_update(void *kdata, void *old_kdata)
+{
+	return -EOPNOTSUPP;
+}
+
 struct bpf_struct_ops bpf_bpf_dummy_ops = {
 	.verifier_ops = &bpf_dummy_verifier_ops,
 	.init = bpf_dummy_init,
 	.check_member = bpf_dummy_ops_check_member,
 	.init_member = bpf_dummy_init_member,
 	.reg = bpf_dummy_reg,
+	.update = bpf_dummy_update,
 	.unreg = bpf_dummy_unreg,
 	.name = "bpf_dummy_ops",
 };
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 13fc0c185cd9..66ce5fadfe42 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -266,10 +266,16 @@ static void bpf_tcp_ca_unreg(void *kdata)
 	tcp_unregister_congestion_control(kdata);
 }
 
+static int bpf_tcp_ca_update(void *kdata, void *old_kdata)
+{
+	return tcp_update_congestion_control(kdata, old_kdata);
+}
+
 struct bpf_struct_ops bpf_tcp_congestion_ops = {
 	.verifier_ops = &bpf_tcp_ca_verifier_ops,
 	.reg = bpf_tcp_ca_reg,
 	.unreg = bpf_tcp_ca_unreg,
+	.update = bpf_tcp_ca_update,
 	.check_member = bpf_tcp_ca_check_member,
 	.init_member = bpf_tcp_ca_init_member,
 	.init = bpf_tcp_ca_init,
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index db8b4b488c31..22fd7c12360e 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -130,6 +130,45 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca)
 }
 EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
 
+/* Replace a registered old ca with a new one.
+ *
+ * The new ca must have the same name as the old one, that has been
+ * registered.
+ */
+int tcp_update_congestion_control(struct tcp_congestion_ops *ca, struct tcp_congestion_ops *old_ca)
+{
+	struct tcp_congestion_ops *existing;
+	int ret = 0;
+
+	/* all algorithms must implement these */
+	if (!ca->ssthresh || !ca->undo_cwnd ||
+	    !(ca->cong_avoid || ca->cong_control)) {
+		pr_err("%s does not implement required ops\n", old_ca->name);
+		return -EINVAL;
+	}
+
+	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
+
+	spin_lock(&tcp_cong_list_lock);
+	existing = tcp_ca_find_key(ca->key);
+	if (ca->key == TCP_CA_UNSPEC || !existing || strcmp(existing->name, ca->name)) {
+		pr_notice("%s not registered or non-unique key\n",
+			  ca->name);
+		ret = -EINVAL;
+	} else if (existing != old_ca) {
+		pr_notice("invalid old congestion control algorithm to replace\n");
+		ret = -EINVAL;
+	} else {
+		list_del_rcu(&existing->list);
+		list_add_tail_rcu(&ca->list, &tcp_cong_list);
+		pr_debug("%s updated\n", ca->name);
+	}
+	spin_unlock(&tcp_cong_list_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tcp_update_congestion_control);
+
 u32 tcp_ca_get_key_by_name(struct net *net, const char *name, bool *ecn_ca)
 {
 	const struct tcp_congestion_ops *ca;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links.
  2023-02-14 22:17 [PATCH bpf-next 0/7] Transit between BPF TCP congestion controls Kui-Feng Lee
  2023-02-14 22:17 ` [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
  2023-02-14 22:17 ` [PATCH bpf-next 2/7] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
@ 2023-02-14 22:17 ` Kui-Feng Lee
  2023-02-15  2:53   ` Stanislav Fomichev
  2023-02-16  0:37   ` Martin KaFai Lau
  2023-02-14 22:17 ` [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-14 22:17 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: Kui-Feng Lee

Registration via bpf_links ensures a uniform behavior, just like other
BPF programs.  BPF struct_ops were registered/unregistered when
updating/deleting their values.  Only the maps of struct_ops having
the BPF_F_LINK flag are allowed to back a bpf_link.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/uapi/linux/bpf.h       |  3 ++
 kernel/bpf/bpf_struct_ops.c    | 59 +++++++++++++++++++++++++++++++---
 tools/include/uapi/linux/bpf.h |  3 ++
 3 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1e6cdd0f355d..48d8b3058aa1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1267,6 +1267,9 @@ enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+
+/* Create a map that will be registered/unregesitered by the backed bpf_link */
+	BPF_F_LINK		= (1U << 13),
 };
 
 /* Flags for BPF_PROG_QUERY. */
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 621c8e24481a..d16ca06cf09a 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -390,7 +390,7 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 
 	mutex_lock(&st_map->lock);
 
-	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
+	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT || refcount_read(&kvalue->refcnt)) {
 		err = -EBUSY;
 		goto unlock;
 	}
@@ -491,6 +491,12 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 		*(unsigned long *)(udata + moff) = prog->aux->id;
 	}
 
+	if (st_map->map.map_flags & BPF_F_LINK) {
+		/* Let bpf_link handle registration & unregistration. */
+		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
+		goto unlock;
+	}
+
 	refcount_set(&kvalue->refcnt, 1);
 	bpf_map_inc(map);
 
@@ -522,6 +528,7 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 	kfree(tlinks);
 	mutex_unlock(&st_map->lock);
 	return err;
+
 }
 
 static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
@@ -535,6 +542,8 @@ static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
 			     BPF_STRUCT_OPS_STATE_TOBEFREE);
 	switch (prev_state) {
 	case BPF_STRUCT_OPS_STATE_INUSE:
+		if (st_map->map.map_flags & BPF_F_LINK)
+			return 0;
 		st_map->st_ops->unreg(&st_map->kvalue.data);
 		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
 			bpf_map_put(map);
@@ -585,7 +594,7 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
 {
 	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
-	    attr->map_flags || !attr->btf_vmlinux_value_type_id)
+	    (attr->map_flags & ~BPF_F_LINK) || !attr->btf_vmlinux_value_type_id)
 		return -EINVAL;
 	return 0;
 }
@@ -638,6 +647,8 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	set_vm_flush_reset_perms(st_map->image);
 	bpf_map_init_from_attr(map, attr);
 
+	map->map_flags |= attr->map_flags & BPF_F_LINK;
+
 	return map;
 }
 
@@ -699,10 +710,25 @@ void bpf_struct_ops_put(const void *kdata)
 	}
 }
 
+static void bpf_struct_ops_kvalue_put(struct bpf_struct_ops_value *kvalue)
+{
+	struct bpf_struct_ops_map *st_map;
+
+	if (refcount_dec_and_test(&kvalue->refcnt)) {
+		st_map = container_of(kvalue, struct bpf_struct_ops_map,
+				      kvalue);
+		bpf_map_put(&st_map->map);
+	}
+}
+
 static void bpf_struct_ops_map_link_release(struct bpf_link *link)
 {
+	struct bpf_struct_ops_map *st_map;
+
 	if (link->map) {
-		bpf_map_put(link->map);
+		st_map = (struct bpf_struct_ops_map *)link->map;
+		st_map->st_ops->unreg(&st_map->kvalue.data);
+		bpf_struct_ops_kvalue_put(&st_map->kvalue);
 		link->map = NULL;
 	}
 }
@@ -735,13 +761,15 @@ static const struct bpf_link_ops bpf_struct_ops_map_lops = {
 
 int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
 {
+	struct bpf_struct_ops_map *st_map;
 	struct bpf_link_primer link_primer;
+	struct bpf_struct_ops_value *kvalue;
 	struct bpf_map *map;
 	struct bpf_link *link = NULL;
 	int err;
 
 	map = bpf_map_get(attr->link_create.prog_fd);
-	if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
+	if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(map->map_flags & BPF_F_LINK))
 		return -EINVAL;
 
 	link = kzalloc(sizeof(*link), GFP_USER);
@@ -752,6 +780,29 @@ int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
 	bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
 	link->map = map;
 
+	if (map->map_flags & BPF_F_LINK) {
+		st_map = (struct bpf_struct_ops_map *)map;
+		kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
+
+		if (kvalue->state != BPF_STRUCT_OPS_STATE_INUSE ||
+		    refcount_read(&kvalue->refcnt) != 0) {
+			err = -EINVAL;
+			goto err_out;
+		}
+
+		refcount_set(&kvalue->refcnt, 1);
+
+		set_memory_rox((long)st_map->image, 1);
+		err = st_map->st_ops->reg(kvalue->data);
+		if (err) {
+			refcount_set(&kvalue->refcnt, 0);
+
+			set_memory_nx((long)st_map->image, 1);
+			set_memory_rw((long)st_map->image, 1);
+			goto err_out;
+		}
+	}
+
 	err = bpf_link_prime(link, &link_primer);
 	if (err)
 		goto err_out;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1e6cdd0f355d..48d8b3058aa1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1267,6 +1267,9 @@ enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+
+/* Create a map that will be registered/unregesitered by the backed bpf_link */
+	BPF_F_LINK		= (1U << 13),
 };
 
 /* Flags for BPF_PROG_QUERY. */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-02-14 22:17 [PATCH bpf-next 0/7] Transit between BPF TCP congestion controls Kui-Feng Lee
                   ` (2 preceding siblings ...)
  2023-02-14 22:17 ` [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links Kui-Feng Lee
@ 2023-02-14 22:17 ` Kui-Feng Lee
  2023-02-15  2:58   ` Stanislav Fomichev
  2023-02-16 22:40   ` Andrii Nakryiko
  2023-02-14 22:17 ` [PATCH bpf-next 5/7] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-14 22:17 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: Kui-Feng Lee

bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
placeholder, but now it is constructing an authentic one by calling
bpf_link_create() if the map has the BPF_F_LINK flag.

You can flag a struct_ops map with BPF_F_LINK by calling
bpf_map__set_map_flags().

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 tools/lib/bpf/libbpf.c | 73 +++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 15 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 75ed95b7e455..1eff6a03ddd9 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -11430,29 +11430,41 @@ struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
 	return link;
 }
 
+struct bpf_link_struct_ops_map {
+	struct bpf_link link;
+	int map_fd;
+};
+
 static int bpf_link__detach_struct_ops(struct bpf_link *link)
 {
+	struct bpf_link_struct_ops_map *st_link;
 	__u32 zero = 0;
 
-	if (bpf_map_delete_elem(link->fd, &zero))
+	st_link = container_of(link, struct bpf_link_struct_ops_map, link);
+
+	if (st_link->map_fd < 0) {
+		/* Fake bpf_link */
+		if (bpf_map_delete_elem(link->fd, &zero))
+			return -errno;
+		return 0;
+	}
+
+	if (bpf_map_delete_elem(st_link->map_fd, &zero))
+		return -errno;
+
+	if (close(link->fd))
 		return -errno;
 
 	return 0;
 }
 
-struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+/*
+ * Update the map with the prepared vdata.
+ */
+static int bpf_map__update_vdata(const struct bpf_map *map)
 {
 	struct bpf_struct_ops *st_ops;
-	struct bpf_link *link;
 	__u32 i, zero = 0;
-	int err;
-
-	if (!bpf_map__is_struct_ops(map) || map->fd == -1)
-		return libbpf_err_ptr(-EINVAL);
-
-	link = calloc(1, sizeof(*link));
-	if (!link)
-		return libbpf_err_ptr(-EINVAL);
 
 	st_ops = map->st_ops;
 	for (i = 0; i < btf_vlen(st_ops->type); i++) {
@@ -11468,17 +11480,48 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 		*(unsigned long *)kern_data = prog_fd;
 	}
 
-	err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
+	return bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
+}
+
+struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+{
+	struct bpf_link_struct_ops_map *link;
+	int err, fd;
+
+	if (!bpf_map__is_struct_ops(map) || map->fd == -1)
+		return libbpf_err_ptr(-EINVAL);
+
+	link = calloc(1, sizeof(*link));
+	if (!link)
+		return libbpf_err_ptr(-EINVAL);
+
+	err = bpf_map__update_vdata(map);
 	if (err) {
 		err = -errno;
 		free(link);
 		return libbpf_err_ptr(err);
 	}
 
-	link->detach = bpf_link__detach_struct_ops;
-	link->fd = map->fd;
+	link->link.detach = bpf_link__detach_struct_ops;
 
-	return link;
+	if (!(map->def.map_flags & BPF_F_LINK)) {
+		/* Fake bpf_link */
+		link->link.fd = map->fd;
+		link->map_fd = -1;
+		return &link->link;
+	}
+
+	fd = bpf_link_create(map->fd, -1, BPF_STRUCT_OPS_MAP, NULL);
+	if (fd < 0) {
+		err = -errno;
+		free(link);
+		return libbpf_err_ptr(err);
+	}
+
+	link->link.fd = fd;
+	link->map_fd = map->fd;
+
+	return &link->link;
 }
 
 typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH bpf-next 5/7] bpf: Update the struct_ops of a bpf_link.
  2023-02-14 22:17 [PATCH bpf-next 0/7] Transit between BPF TCP congestion controls Kui-Feng Lee
                   ` (3 preceding siblings ...)
  2023-02-14 22:17 ` [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
@ 2023-02-14 22:17 ` Kui-Feng Lee
  2023-02-16  1:02   ` Martin KaFai Lau
  2023-02-14 22:17 ` [PATCH bpf-next 6/7] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
  2023-02-14 22:17 ` [PATCH bpf-next 7/7] selftests/bpf: Test switching TCP Congestion Control algorithms Kui-Feng Lee
  6 siblings, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-14 22:17 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: Kui-Feng Lee

By improving the BPF_LINK_UPDATE command of bpf(), it should allow you
to conveniently switch between different struct_ops on a single
bpf_link. This would enable smoother transitions from one struct_ops
to another.

The struct_ops maps passing along with BPF_LINK_UPDATE should have the
BPF_F_LINK flag.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf.h         |  1 +
 include/uapi/linux/bpf.h    |  8 ++++--
 kernel/bpf/bpf_struct_ops.c | 55 +++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c        | 31 +++++++++++++++++++++
 net/ipv4/bpf_tcp_ca.c       |  2 --
 5 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5fe39f56a760..03a15dc26d7a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1408,6 +1408,7 @@ struct bpf_link_ops {
 	void (*show_fdinfo)(const struct bpf_link *link, struct seq_file *seq);
 	int (*fill_link_info)(const struct bpf_link *link,
 			      struct bpf_link_info *info);
+	int (*update_struct_ops)(struct bpf_link *link, struct bpf_map *new_map);
 };
 
 struct bpf_tramp_link {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 48d8b3058aa1..7c009ac859c8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1552,8 +1552,12 @@ union bpf_attr {
 
 	struct { /* struct used by BPF_LINK_UPDATE command */
 		__u32		link_fd;	/* link fd */
-		/* new program fd to update link with */
-		__u32		new_prog_fd;
+		union {
+			/* new program fd to update link with */
+			__u32		new_prog_fd;
+			/* new struct_ops map fd to update link with */
+			__u32           new_map_fd;
+		};
 		__u32		flags;		/* extra flags */
 		/* expected link's program fd; is specified only if
 		 * BPF_F_REPLACE flag is set in flags */
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index d16ca06cf09a..d329621fc721 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -752,11 +752,66 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
 	return 0;
 }
 
+static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map *new_map)
+{
+	struct bpf_struct_ops_value *kvalue;
+	struct bpf_struct_ops_map *st_map, *old_st_map;
+	struct bpf_map *old_map;
+	int err;
+
+	if (new_map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(new_map->map_flags & BPF_F_LINK))
+		return -EINVAL;
+
+	old_map = link->map;
+
+	/* It does nothing if the new map is the same as the old one.
+	 * A struct_ops that backs a bpf_link can not be updated or
+	 * its kvalue would be updated and causes inconsistencies.
+	 */
+	if (old_map == new_map)
+		return 0;
+
+	/* The new and old struct_ops must be the same type. */
+	st_map = (struct bpf_struct_ops_map *)new_map;
+	old_st_map = (struct bpf_struct_ops_map *)old_map;
+	if (st_map->st_ops != old_st_map->st_ops)
+		return -EINVAL;
+
+	/* Assure the struct_ops is updated (has value) and not
+	 * backing any other link.
+	 */
+	kvalue = &st_map->kvalue;
+	if (kvalue->state != BPF_STRUCT_OPS_STATE_INUSE ||
+	    refcount_read(&kvalue->refcnt) != 0)
+		return -EINVAL;
+
+	bpf_map_inc(new_map);
+	refcount_set(&kvalue->refcnt, 1);
+
+	set_memory_rox((long)st_map->image, 1);
+	err = st_map->st_ops->update(kvalue->data, old_st_map->kvalue.data);
+	if (err) {
+		refcount_set(&kvalue->refcnt, 0);
+
+		set_memory_nx((long)st_map->image, 1);
+		set_memory_rw((long)st_map->image, 1);
+		bpf_map_put(new_map);
+		return err;
+	}
+
+	link->map = new_map;
+
+	bpf_struct_ops_kvalue_put(&old_st_map->kvalue);
+
+	return 0;
+}
+
 static const struct bpf_link_ops bpf_struct_ops_map_lops = {
 	.release = bpf_struct_ops_map_link_release,
 	.dealloc = bpf_struct_ops_map_link_dealloc,
 	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
 	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
+	.update_struct_ops = bpf_struct_ops_map_link_update,
 };
 
 int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 54e172d8f5d1..1341634863b5 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4650,6 +4650,32 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	return ret;
 }
 
+#define BPF_LINK_UPDATE_STRUCT_OPS_LAST_FIELD link_update_struct_ops.new_map_fd
+
+static int link_update_struct_ops(struct bpf_link *link, union bpf_attr *attr)
+{
+	struct bpf_map *new_map;
+	int ret = 0;
+
+	new_map = bpf_map_get(attr->link_update.new_map_fd);
+	if (IS_ERR(new_map))
+		return -EINVAL;
+
+	if (new_map->map_type != BPF_MAP_TYPE_STRUCT_OPS) {
+		ret = -EINVAL;
+		goto out_put_map;
+	}
+
+	if (link->ops->update_struct_ops)
+		ret = link->ops->update_struct_ops(link, new_map);
+	else
+		ret = -EINVAL;
+
+out_put_map:
+	bpf_map_put(new_map);
+	return ret;
+}
+
 #define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
 
 static int link_update(union bpf_attr *attr)
@@ -4670,6 +4696,11 @@ static int link_update(union bpf_attr *attr)
 	if (IS_ERR(link))
 		return PTR_ERR(link);
 
+	if (link->type == BPF_LINK_TYPE_STRUCT_OPS) {
+		ret = link_update_struct_ops(link, attr);
+		goto out_put_link;
+	}
+
 	new_prog = bpf_prog_get(attr->link_update.new_prog_fd);
 	if (IS_ERR(new_prog)) {
 		ret = PTR_ERR(new_prog);
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 66ce5fadfe42..558b01d5250f 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -239,8 +239,6 @@ static int bpf_tcp_ca_init_member(const struct btf_type *t,
 		if (bpf_obj_name_cpy(tcp_ca->name, utcp_ca->name,
 				     sizeof(tcp_ca->name)) <= 0)
 			return -EINVAL;
-		if (tcp_ca_find(utcp_ca->name))
-			return -EEXIST;
 		return 1;
 	}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH bpf-next 6/7] libbpf: Update a bpf_link with another struct_ops.
  2023-02-14 22:17 [PATCH bpf-next 0/7] Transit between BPF TCP congestion controls Kui-Feng Lee
                   ` (4 preceding siblings ...)
  2023-02-14 22:17 ` [PATCH bpf-next 5/7] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
@ 2023-02-14 22:17 ` Kui-Feng Lee
  2023-02-16 22:48   ` Andrii Nakryiko
  2023-02-14 22:17 ` [PATCH bpf-next 7/7] selftests/bpf: Test switching TCP Congestion Control algorithms Kui-Feng Lee
  6 siblings, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-14 22:17 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: Kui-Feng Lee

Introduce bpf_link__update_struct_ops(), which will allow you to
effortlessly transition the struct_ops map of any given bpf_link into
an alternative.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 tools/lib/bpf/libbpf.c   | 35 +++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h   |  1 +
 tools/lib/bpf/libbpf.map |  1 +
 3 files changed, 37 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1eff6a03ddd9..6f7c72e312d4 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -11524,6 +11524,41 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 	return &link->link;
 }
 
+/*
+ * Swap the back struct_ops of a link with a new struct_ops map.
+ */
+int bpf_link__update_struct_ops(struct bpf_link *link, const struct bpf_map *map)
+{
+	struct bpf_link_struct_ops_map *st_ops_link;
+	int err, fd;
+
+	if (!bpf_map__is_struct_ops(map) || map->fd == -1)
+		return -EINVAL;
+
+	/* Ensure the type of a link is correct */
+	if (link->detach != bpf_link__detach_struct_ops)
+		return -EINVAL;
+
+	err = bpf_map__update_vdata(map);
+	if (err) {
+		err = -errno;
+		free(link);
+		return err;
+	}
+
+	fd = bpf_link_update(link->fd, map->fd, NULL);
+	if (fd < 0) {
+		err = -errno;
+		free(link);
+		return err;
+	}
+
+	st_ops_link = container_of(link, struct bpf_link_struct_ops_map, link);
+	st_ops_link->map_fd = map->fd;
+
+	return 0;
+}
+
 typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
 							  void *private_data);
 
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 2efd80f6f7b9..dd25cd6759d4 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -695,6 +695,7 @@ bpf_program__attach_freplace(const struct bpf_program *prog,
 struct bpf_map;
 
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
+LIBBPF_API int bpf_link__update_struct_ops(struct bpf_link *link, const struct bpf_map *map);
 
 struct bpf_iter_attach_opts {
 	size_t sz; /* size of this struct for forward/backward compatibility */
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 11c36a3c1a9f..ca6993c744b6 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -373,6 +373,7 @@ LIBBPF_1.1.0 {
 	global:
 		bpf_btf_get_fd_by_id_opts;
 		bpf_link_get_fd_by_id_opts;
+		bpf_link__update_struct_ops;
 		bpf_map_get_fd_by_id_opts;
 		bpf_prog_get_fd_by_id_opts;
 		user_ring_buffer__discard;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH bpf-next 7/7] selftests/bpf: Test switching TCP Congestion Control algorithms.
  2023-02-14 22:17 [PATCH bpf-next 0/7] Transit between BPF TCP congestion controls Kui-Feng Lee
                   ` (5 preceding siblings ...)
  2023-02-14 22:17 ` [PATCH bpf-next 6/7] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
@ 2023-02-14 22:17 ` Kui-Feng Lee
  2023-02-16 22:50   ` Andrii Nakryiko
  6 siblings, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-14 22:17 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: Kui-Feng Lee

Create a pair of sockets that utilize the congestion control algorithm
under a particular name. Then switch up this congestion control
algorithm to another implementation and check whether newly created
connections using the same cc name now run the new implementation.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 48 ++++++++++++
 .../selftests/bpf/progs/tcp_ca_update.c       | 75 +++++++++++++++++++
 2 files changed, 123 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/tcp_ca_update.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
index e980188d4124..89477e4c9a24 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -8,6 +8,7 @@
 #include "bpf_dctcp.skel.h"
 #include "bpf_cubic.skel.h"
 #include "bpf_tcp_nogpl.skel.h"
+#include "tcp_ca_update.skel.h"
 #include "bpf_dctcp_release.skel.h"
 #include "tcp_ca_write_sk_pacing.skel.h"
 #include "tcp_ca_incompl_cong_ops.skel.h"
@@ -381,6 +382,51 @@ static void test_unsupp_cong_op(void)
 	libbpf_set_print(old_print_fn);
 }
 
+static void test_update_ca(void)
+{
+	struct tcp_ca_update *skel;
+	struct bpf_link *link;
+	int saved_ca1_cnt;
+	int err;
+
+	skel = tcp_ca_update__open();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		return;
+
+	err = bpf_map__set_map_flags(skel->maps.ca_update_1,
+				     bpf_map__map_flags(skel->maps.ca_update_1) | BPF_F_LINK);
+	if (!ASSERT_OK(err, "set_map_flags_1"))
+		return;
+
+	err = bpf_map__set_map_flags(skel->maps.ca_update_2,
+				     bpf_map__map_flags(skel->maps.ca_update_2) | BPF_F_LINK);
+	if (!ASSERT_OK(err, "set_map_flags_2"))
+		return;
+
+	err = tcp_ca_update__load(skel);
+	if (!ASSERT_OK(err, "load")) {
+		tcp_ca_update__destroy(skel);
+		return;
+	}
+
+	link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
+	ASSERT_OK_PTR(link, "attach_struct_ops");
+
+	do_test("tcp_ca_update", NULL);
+	saved_ca1_cnt = skel->bss->ca1_cnt;
+	ASSERT_GT(saved_ca1_cnt, 0, "ca1_ca1_cnt");
+
+	err = bpf_link__update_struct_ops(link, skel->maps.ca_update_2);
+	ASSERT_OK(err, "update_struct_ops");
+
+	do_test("tcp_ca_update", NULL);
+	ASSERT_EQ(skel->bss->ca1_cnt, saved_ca1_cnt, "ca2_ca1_cnt");
+	ASSERT_GT(skel->bss->ca2_cnt, 0, "ca2_ca2_cnt");
+
+	bpf_link__destroy(link);
+	tcp_ca_update__destroy(skel);
+}
+
 void test_bpf_tcp_ca(void)
 {
 	if (test__start_subtest("dctcp"))
@@ -399,4 +445,6 @@ void test_bpf_tcp_ca(void)
 		test_incompl_cong_ops();
 	if (test__start_subtest("unsupp_cong_op"))
 		test_unsupp_cong_op();
+	if (test__start_subtest("update_ca"))
+		test_update_ca();
 }
diff --git a/tools/testing/selftests/bpf/progs/tcp_ca_update.c b/tools/testing/selftests/bpf/progs/tcp_ca_update.c
new file mode 100644
index 000000000000..cf51fe54ac01
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tcp_ca_update.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+int ca1_cnt = 0;
+int ca2_cnt = 0;
+
+#define USEC_PER_SEC 1000000UL
+
+#define min(a, b) ((a) < (b) ? (a) : (b))
+
+static inline struct tcp_sock *tcp_sk(const struct sock *sk)
+{
+	return (struct tcp_sock *)sk;
+}
+
+SEC("struct_ops/ca_update_init")
+void BPF_PROG(ca_update_init, struct sock *sk)
+{
+#ifdef ENABLE_ATOMICS_TESTS
+	__sync_bool_compare_and_swap(&sk->sk_pacing_status, SK_PACING_NONE,
+				     SK_PACING_NEEDED);
+#else
+	sk->sk_pacing_status = SK_PACING_NEEDED;
+#endif
+}
+
+SEC("struct_ops/ca_update_1_cong_control")
+void BPF_PROG(ca_update_1_cong_control, struct sock *sk,
+	      const struct rate_sample *rs)
+{
+	ca1_cnt++;
+}
+
+SEC("struct_ops/ca_update_2_cong_control")
+void BPF_PROG(ca_update_2_cong_control, struct sock *sk,
+	      const struct rate_sample *rs)
+{
+	ca2_cnt++;
+}
+
+SEC("struct_ops/ca_update_ssthresh")
+__u32 BPF_PROG(ca_update_ssthresh, struct sock *sk)
+{
+	return tcp_sk(sk)->snd_ssthresh;
+}
+
+SEC("struct_ops/ca_update_undo_cwnd")
+__u32 BPF_PROG(ca_update_undo_cwnd, struct sock *sk)
+{
+	return tcp_sk(sk)->snd_cwnd;
+}
+
+SEC(".struct_ops")
+struct tcp_congestion_ops ca_update_1 = {
+	.init = (void *)ca_update_init,
+	.cong_control = (void *)ca_update_1_cong_control,
+	.ssthresh = (void *)ca_update_ssthresh,
+	.undo_cwnd = (void *)ca_update_undo_cwnd,
+	.name = "tcp_ca_update",
+};
+
+SEC(".struct_ops")
+struct tcp_congestion_ops ca_update_2 = {
+	.init = (void *)ca_update_init,
+	.cong_control = (void *)ca_update_2_cong_control,
+	.ssthresh = (void *)ca_update_ssthresh,
+	.undo_cwnd = (void *)ca_update_undo_cwnd,
+	.name = "tcp_ca_update",
+};
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-14 22:17 ` [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
@ 2023-02-15  0:26   ` kernel test robot
  2023-02-15  2:39   ` Stanislav Fomichev
  2023-02-15 22:58   ` Martin KaFai Lau
  2 siblings, 0 replies; 43+ messages in thread
From: kernel test robot @ 2023-02-15  0:26 UTC (permalink / raw)
  To: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: oe-kbuild-all, Kui-Feng Lee

Hi Kui-Feng,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Kui-Feng-Lee/bpf-Create-links-for-BPF-struct_ops-maps/20230215-061816
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230214221718.503964-2-kuifeng%40meta.com
patch subject: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
config: powerpc-allyesconfig (https://download.01.org/0day-ci/archive/20230215/202302150809.TbWg3iN6-lkp@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/c186fe88190559d4872279724d598b8d45ba3092
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Kui-Feng-Lee/bpf-Create-links-for-BPF-struct_ops-maps/20230215-061816
        git checkout c186fe88190559d4872279724d598b8d45ba3092
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=powerpc SHELL=/bin/bash kernel/bpf/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302150809.TbWg3iN6-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/bpf/bpf_struct_ops.c:736:5: warning: no previous prototype for 'link_create_struct_ops_map' [-Wmissing-prototypes]
     736 | int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
         |     ^~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/link_create_struct_ops_map +736 kernel/bpf/bpf_struct_ops.c

   735	
 > 736	int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
   737	{
   738		struct bpf_link_primer link_primer;
   739		struct bpf_map *map;
   740		struct bpf_link *link = NULL;
   741		int err;
   742	
   743		map = bpf_map_get(attr->link_create.prog_fd);
   744		if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
   745			return -EINVAL;
   746	
   747		link = kzalloc(sizeof(*link), GFP_USER);
   748		if (!link) {
   749			err = -ENOMEM;
   750			goto err_out;
   751		}
   752		bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
   753		link->map = map;
   754	
   755		err = bpf_link_prime(link, &link_primer);
   756		if (err)
   757			goto err_out;
   758	
   759		return bpf_link_settle(&link_primer);
   760	
   761	err_out:
   762		bpf_map_put(map);
   763		kfree(link);
   764		return err;
   765	}
   766	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-14 22:17 ` [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
  2023-02-15  0:26   ` kernel test robot
@ 2023-02-15  2:39   ` Stanislav Fomichev
  2023-02-15 18:04     ` Kui-Feng Lee
  2023-02-15 22:58   ` Martin KaFai Lau
  2 siblings, 1 reply; 43+ messages in thread
From: Stanislav Fomichev @ 2023-02-15  2:39 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii

On 02/14, Kui-Feng Lee wrote:
> BPF struct_ops maps are employed directly to register TCP Congestion
> Control algorithms. Unlike other BPF programs that terminate when
> their links gone, the struct_ops program reduces its refcount solely
> upon death of its FD. The link of a BPF struct_ops map provides a
> uniform experience akin to other types of BPF programs.  A TCP
> Congestion Control algorithm will be unregistered upon destroying the
> FD in the following patches.

> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   include/linux/bpf.h            |  6 +++-
>   include/uapi/linux/bpf.h       |  4 +++
>   kernel/bpf/bpf_struct_ops.c    | 66 ++++++++++++++++++++++++++++++++++
>   kernel/bpf/syscall.c           | 29 ++++++++++-----
>   tools/include/uapi/linux/bpf.h |  4 +++
>   tools/lib/bpf/bpf.c            |  2 ++
>   tools/lib/bpf/libbpf.c         |  1 +
>   7 files changed, 102 insertions(+), 10 deletions(-)

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8b5d0b4c4ada..13683584b071 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1391,7 +1391,11 @@ struct bpf_link {
>   	u32 id;
>   	enum bpf_link_type type;
>   	const struct bpf_link_ops *ops;
> -	struct bpf_prog *prog;

[..]

> +	union {
> +		struct bpf_prog *prog;
> +		/* Backed by a struct_ops (type == BPF_LINK_UPDATE_STRUCT_OPS) */
> +		struct bpf_map *map;
> +	};

Any reason you're not using the approach that has been used for other
links where we create a wrapping structure + container_of?

struct bpt_struct_ops_link {
	struct bpf_link link;
	struct bpf_map *map;
};

>   	struct work_struct work;
>   };

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 17afd2b35ee5..1e6cdd0f355d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
>   	BPF_PERF_EVENT,
>   	BPF_TRACE_KPROBE_MULTI,
>   	BPF_LSM_CGROUP,
> +	BPF_STRUCT_OPS_MAP,
>   	__MAX_BPF_ATTACH_TYPE
>   };

> @@ -6354,6 +6355,9 @@ struct bpf_link_info {
>   		struct {
>   			__u32 ifindex;
>   		} xdp;
> +		struct {
> +			__u32 map_id;
> +		} struct_ops_map;
>   	};
>   } __attribute__((aligned(8)));

> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index ece9870cab68..621c8e24481a 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -698,3 +698,69 @@ void bpf_struct_ops_put(const void *kdata)
>   		call_rcu(&st_map->rcu, bpf_struct_ops_put_rcu);
>   	}
>   }
> +
> +static void bpf_struct_ops_map_link_release(struct bpf_link *link)
> +{
> +	if (link->map) {
> +		bpf_map_put(link->map);
> +		link->map = NULL;
> +	}
> +}
> +
> +static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
> +{
> +	kfree(link);
> +}
> +
> +static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link  
> *link,
> +					    struct seq_file *seq)
> +{
> +	seq_printf(seq, "map_id:\t%d\n",
> +		  link->map->id);
> +}
> +
> +static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link  
> *link,
> +					       struct bpf_link_info *info)
> +{
> +	info->struct_ops_map.map_id = link->map->id;
> +	return 0;
> +}
> +
> +static const struct bpf_link_ops bpf_struct_ops_map_lops = {
> +	.release = bpf_struct_ops_map_link_release,
> +	.dealloc = bpf_struct_ops_map_link_dealloc,
> +	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
> +	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
> +};
> +
> +int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
> +{

[..]

> +	struct bpf_link_primer link_primer;
> +	struct bpf_map *map;
> +	struct bpf_link *link = NULL;

Are we still trying to keep reverse christmas trees?

> +	int err;
> +
> +	map = bpf_map_get(attr->link_create.prog_fd);

bpf_map_get can fail here?

> +	if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
> +		return -EINVAL;
> +
> +	link = kzalloc(sizeof(*link), GFP_USER);
> +	if (!link) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops,  
> NULL);
> +	link->map = map;
> +
> +	err = bpf_link_prime(link, &link_primer);
> +	if (err)
> +		goto err_out;
> +
> +	return bpf_link_settle(&link_primer);
> +
> +err_out:
> +	bpf_map_put(map);
> +	kfree(link);
> +	return err;
> +}
> +
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index cda8d00f3762..54e172d8f5d1 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2738,7 +2738,9 @@ static void bpf_link_free(struct bpf_link *link)
>   	if (link->prog) {
>   		/* detach BPF program, clean up used resources */
>   		link->ops->release(link);
> -		bpf_prog_put(link->prog);
> +		if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
> +			bpf_prog_put(link->prog);
> +		/* The struct_ops links clean up map by them-selves. */

Why not more generic:

if (link->prog)
	bpf_prog_put(link->prog);

?


>   	}
>   	/* free bpf_link and its containing memory */
>   	link->ops->dealloc(link);
> @@ -2794,16 +2796,19 @@ static void bpf_link_show_fdinfo(struct seq_file  
> *m, struct file *filp)
>   	const struct bpf_prog *prog = link->prog;
>   	char prog_tag[sizeof(prog->tag) * 2 + 1] = { };

> -	bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
>   	seq_printf(m,
>   		   "link_type:\t%s\n"
> -		   "link_id:\t%u\n"
> -		   "prog_tag:\t%s\n"
> -		   "prog_id:\t%u\n",
> +		   "link_id:\t%u\n",
>   		   bpf_link_type_strs[link->type],
> -		   link->id,
> -		   prog_tag,
> -		   prog->aux->id);
> +		   link->id);
> +	if (link->type != BPF_LINK_TYPE_STRUCT_OPS) {
> +		bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
> +		seq_printf(m,
> +			   "prog_tag:\t%s\n"
> +			   "prog_id:\t%u\n",
> +			   prog_tag,
> +			   prog->aux->id);
> +	}
>   	if (link->ops->show_fdinfo)
>   		link->ops->show_fdinfo(link, m);
>   }
> @@ -4278,7 +4283,8 @@ static int bpf_link_get_info_by_fd(struct file  
> *file,

>   	info.type = link->type;
>   	info.id = link->id;
> -	info.prog_id = link->prog->aux->id;
> +	if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
> +		info.prog_id = link->prog->aux->id;

Here as well: should we have "link->type != BPF_LINK_TYPE_STRUCT_OPS" vs
"link->prog != NULL" ?


>   	if (link->ops->fill_link_info) {
>   		err = link->ops->fill_link_info(link, &info);
> @@ -4531,6 +4537,8 @@ static int bpf_map_do_batch(const union bpf_attr  
> *attr,
>   	return err;
>   }

> +extern int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t  
> uattr);
> +
>   #define BPF_LINK_CREATE_LAST_FIELD link_create.kprobe_multi.cookies
>   static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>   {
> @@ -4541,6 +4549,9 @@ static int link_create(union bpf_attr *attr,  
> bpfptr_t uattr)
>   	if (CHECK_ATTR(BPF_LINK_CREATE))
>   		return -EINVAL;

> +	if (attr->link_create.attach_type == BPF_STRUCT_OPS_MAP)
> +		return link_create_struct_ops_map(attr, uattr);
> +
>   	prog = bpf_prog_get(attr->link_create.prog_fd);
>   	if (IS_ERR(prog))
>   		return PTR_ERR(prog);
> diff --git a/tools/include/uapi/linux/bpf.h  
> b/tools/include/uapi/linux/bpf.h
> index 17afd2b35ee5..1e6cdd0f355d 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
>   	BPF_PERF_EVENT,
>   	BPF_TRACE_KPROBE_MULTI,
>   	BPF_LSM_CGROUP,
> +	BPF_STRUCT_OPS_MAP,
>   	__MAX_BPF_ATTACH_TYPE
>   };

> @@ -6354,6 +6355,9 @@ struct bpf_link_info {
>   		struct {
>   			__u32 ifindex;
>   		} xdp;
> +		struct {
> +			__u32 map_id;
> +		} struct_ops_map;
>   	};
>   } __attribute__((aligned(8)));

> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 9aff98f42a3d..e44d49f17c86 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -731,6 +731,8 @@ int bpf_link_create(int prog_fd, int target_fd,
>   		if (!OPTS_ZEROED(opts, tracing))
>   			return libbpf_err(-EINVAL);
>   		break;
> +	case BPF_STRUCT_OPS_MAP:
> +		break;
>   	default:
>   		if (!OPTS_ZEROED(opts, flags))
>   			return libbpf_err(-EINVAL);
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 35a698eb825d..75ed95b7e455 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -115,6 +115,7 @@ static const char * const attach_type_name[] = {
>   	[BPF_SK_REUSEPORT_SELECT_OR_MIGRATE]	= "sk_reuseport_select_or_migrate",
>   	[BPF_PERF_EVENT]		= "perf_event",
>   	[BPF_TRACE_KPROBE_MULTI]	= "trace_kprobe_multi",
> +	[BPF_STRUCT_OPS_MAP]		= "struct_ops_map",
>   };

>   static const char * const link_type_name[] = {
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 2/7] net: Update an existing TCP congestion control algorithm.
  2023-02-14 22:17 ` [PATCH bpf-next 2/7] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
@ 2023-02-15  2:43   ` Stanislav Fomichev
  2023-02-15 18:15     ` Kui-Feng Lee
  0 siblings, 1 reply; 43+ messages in thread
From: Stanislav Fomichev @ 2023-02-15  2:43 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii

On 02/14, Kui-Feng Lee wrote:
> This feature lets you immediately transition to another congestion
> control algorithm or implementation with the same name.  Once a name
> is updated, new connections will apply this new algorithm.

> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   include/linux/bpf.h            |  1 +
>   include/net/tcp.h              |  2 ++
>   net/bpf/bpf_dummy_struct_ops.c |  6 ++++++
>   net/ipv4/bpf_tcp_ca.c          |  6 ++++++
>   net/ipv4/tcp_cong.c            | 39 ++++++++++++++++++++++++++++++++++
>   5 files changed, 54 insertions(+)

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 13683584b071..5fe39f56a760 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1450,6 +1450,7 @@ struct bpf_struct_ops {
>   			   void *kdata, const void *udata);
>   	int (*reg)(void *kdata);
>   	void (*unreg)(void *kdata);
> +	int (*update)(void *kdata, void *old_kdata);
>   	const struct btf_type *type;
>   	const struct btf_type *value_type;
>   	const char *name;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index db9f828e9d1e..239cc0e2639c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1117,6 +1117,8 @@ struct tcp_congestion_ops {

>   int tcp_register_congestion_control(struct tcp_congestion_ops *type);
>   void tcp_unregister_congestion_control(struct tcp_congestion_ops *type);
> +int tcp_update_congestion_control(struct tcp_congestion_ops *type,
> +				  struct tcp_congestion_ops *old_type);

>   void tcp_assign_congestion_control(struct sock *sk);
>   void tcp_init_congestion_control(struct sock *sk);
> diff --git a/net/bpf/bpf_dummy_struct_ops.c  
> b/net/bpf/bpf_dummy_struct_ops.c
> index ff4f89a2b02a..158f14e240d0 100644
> --- a/net/bpf/bpf_dummy_struct_ops.c
> +++ b/net/bpf/bpf_dummy_struct_ops.c
> @@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata)
>   {
>   }

> +static int bpf_dummy_update(void *kdata, void *old_kdata)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
>   struct bpf_struct_ops bpf_bpf_dummy_ops = {
>   	.verifier_ops = &bpf_dummy_verifier_ops,
>   	.init = bpf_dummy_init,
>   	.check_member = bpf_dummy_ops_check_member,
>   	.init_member = bpf_dummy_init_member,
>   	.reg = bpf_dummy_reg,
> +	.update = bpf_dummy_update,
>   	.unreg = bpf_dummy_unreg,
>   	.name = "bpf_dummy_ops",
>   };
> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> index 13fc0c185cd9..66ce5fadfe42 100644
> --- a/net/ipv4/bpf_tcp_ca.c
> +++ b/net/ipv4/bpf_tcp_ca.c
> @@ -266,10 +266,16 @@ static void bpf_tcp_ca_unreg(void *kdata)
>   	tcp_unregister_congestion_control(kdata);
>   }

> +static int bpf_tcp_ca_update(void *kdata, void *old_kdata)
> +{
> +	return tcp_update_congestion_control(kdata, old_kdata);
> +}
> +
>   struct bpf_struct_ops bpf_tcp_congestion_ops = {
>   	.verifier_ops = &bpf_tcp_ca_verifier_ops,
>   	.reg = bpf_tcp_ca_reg,
>   	.unreg = bpf_tcp_ca_unreg,
> +	.update = bpf_tcp_ca_update,
>   	.check_member = bpf_tcp_ca_check_member,
>   	.init_member = bpf_tcp_ca_init_member,
>   	.init = bpf_tcp_ca_init,
> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
> index db8b4b488c31..22fd7c12360e 100644
> --- a/net/ipv4/tcp_cong.c
> +++ b/net/ipv4/tcp_cong.c
> @@ -130,6 +130,45 @@ void tcp_unregister_congestion_control(struct  
> tcp_congestion_ops *ca)
>   }
>   EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);

> +/* Replace a registered old ca with a new one.
> + *
> + * The new ca must have the same name as the old one, that has been
> + * registered.
> + */
> +int tcp_update_congestion_control(struct tcp_congestion_ops *ca, struct  
> tcp_congestion_ops *old_ca)
> +{
> +	struct tcp_congestion_ops *existing;
> +	int ret = 0;
> +

[..]

> +	/* all algorithms must implement these */
> +	if (!ca->ssthresh || !ca->undo_cwnd ||
> +	    !(ca->cong_avoid || ca->cong_control)) {
> +		pr_err("%s does not implement required ops\n", old_ca->name);
> +		return -EINVAL;
> +	}
> +
> +	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));

Can we have this as some common _validate method to avoid copy-paste
from tcp_register_congestion_control.

Or, even better, can we can since function handle both cases?

tcp_register_congestion_control(ca, old_ca);
	- when old_ca == NULL -> register
	- when old_ca != NULL -> try to update

> +
> +	spin_lock(&tcp_cong_list_lock);
> +	existing = tcp_ca_find_key(ca->key);
> +	if (ca->key == TCP_CA_UNSPEC || !existing || strcmp(existing->name,  
> ca->name)) {
> +		pr_notice("%s not registered or non-unique key\n",
> +			  ca->name);
> +		ret = -EINVAL;
> +	} else if (existing != old_ca) {
> +		pr_notice("invalid old congestion control algorithm to replace\n");
> +		ret = -EINVAL;
> +	} else {
> +		list_del_rcu(&existing->list);
> +		list_add_tail_rcu(&ca->list, &tcp_cong_list);
> +		pr_debug("%s updated\n", ca->name);
> +	}
> +	spin_unlock(&tcp_cong_list_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tcp_update_congestion_control);
> +
>   u32 tcp_ca_get_key_by_name(struct net *net, const char *name, bool  
> *ecn_ca)
>   {
>   	const struct tcp_congestion_ops *ca;
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links.
  2023-02-14 22:17 ` [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links Kui-Feng Lee
@ 2023-02-15  2:53   ` Stanislav Fomichev
  2023-02-15 18:29     ` Kui-Feng Lee
  2023-02-16  0:37   ` Martin KaFai Lau
  1 sibling, 1 reply; 43+ messages in thread
From: Stanislav Fomichev @ 2023-02-15  2:53 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii

On 02/14, Kui-Feng Lee wrote:
> Registration via bpf_links ensures a uniform behavior, just like other
> BPF programs.  BPF struct_ops were registered/unregistered when
> updating/deleting their values.  Only the maps of struct_ops having
> the BPF_F_LINK flag are allowed to back a bpf_link.

> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   include/uapi/linux/bpf.h       |  3 ++
>   kernel/bpf/bpf_struct_ops.c    | 59 +++++++++++++++++++++++++++++++---
>   tools/include/uapi/linux/bpf.h |  3 ++
>   3 files changed, 61 insertions(+), 4 deletions(-)

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1e6cdd0f355d..48d8b3058aa1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1267,6 +1267,9 @@ enum {

>   /* Create a map that is suitable to be an inner map with dynamic max  
> entries */
>   	BPF_F_INNER_MAP		= (1U << 12),
> +
> +/* Create a map that will be registered/unregesitered by the backed  
> bpf_link */
> +	BPF_F_LINK		= (1U << 13),
>   };

>   /* Flags for BPF_PROG_QUERY. */
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 621c8e24481a..d16ca06cf09a 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -390,7 +390,7 @@ static int bpf_struct_ops_map_update_elem(struct  
> bpf_map *map, void *key,

>   	mutex_lock(&st_map->lock);

> -	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
> +	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT ||  
> refcount_read(&kvalue->refcnt)) {
>   		err = -EBUSY;
>   		goto unlock;
>   	}
> @@ -491,6 +491,12 @@ static int bpf_struct_ops_map_update_elem(struct  
> bpf_map *map, void *key,
>   		*(unsigned long *)(udata + moff) = prog->aux->id;
>   	}

> +	if (st_map->map.map_flags & BPF_F_LINK) {
> +		/* Let bpf_link handle registration & unregistration. */
> +		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
> +		goto unlock;
> +	}
> +
>   	refcount_set(&kvalue->refcnt, 1);
>   	bpf_map_inc(map);


[..]

> @@ -522,6 +528,7 @@ static int bpf_struct_ops_map_update_elem(struct  
> bpf_map *map, void *key,
>   	kfree(tlinks);
>   	mutex_unlock(&st_map->lock);
>   	return err;
> +
>   }

>   static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)

Seems like a left over hunk?

> @@ -535,6 +542,8 @@ static int bpf_struct_ops_map_delete_elem(struct  
> bpf_map *map, void *key)
>   			     BPF_STRUCT_OPS_STATE_TOBEFREE);
>   	switch (prev_state) {
>   	case BPF_STRUCT_OPS_STATE_INUSE:
> +		if (st_map->map.map_flags & BPF_F_LINK)
> +			return 0;
>   		st_map->st_ops->unreg(&st_map->kvalue.data);
>   		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
>   			bpf_map_put(map);
> @@ -585,7 +594,7 @@ static void bpf_struct_ops_map_free(struct bpf_map  
> *map)
>   static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
>   {
>   	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
> -	    attr->map_flags || !attr->btf_vmlinux_value_type_id)
> +	    (attr->map_flags & ~BPF_F_LINK) || !attr->btf_vmlinux_value_type_id)
>   		return -EINVAL;
>   	return 0;
>   }
> @@ -638,6 +647,8 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union  
> bpf_attr *attr)
>   	set_vm_flush_reset_perms(st_map->image);
>   	bpf_map_init_from_attr(map, attr);


[..]

> +	map->map_flags |= attr->map_flags & BPF_F_LINK;

You seem to have the following check above:

if (.... (attr->map_flags & ~BPF_F_LINK) ...) return -EINVAL;

And here you do:

map->map_flags |= attr->map_flags & BPF_F_LINK;

So maybe we can simplify to:
map->map_flags |= attr->map_flags;

?

> +
>   	return map;
>   }

> @@ -699,10 +710,25 @@ void bpf_struct_ops_put(const void *kdata)
>   	}
>   }

> +static void bpf_struct_ops_kvalue_put(struct bpf_struct_ops_value  
> *kvalue)
> +{
> +	struct bpf_struct_ops_map *st_map;
> +
> +	if (refcount_dec_and_test(&kvalue->refcnt)) {
> +		st_map = container_of(kvalue, struct bpf_struct_ops_map,
> +				      kvalue);
> +		bpf_map_put(&st_map->map);
> +	}
> +}
> +
>   static void bpf_struct_ops_map_link_release(struct bpf_link *link)
>   {
> +	struct bpf_struct_ops_map *st_map;
> +
>   	if (link->map) {
> -		bpf_map_put(link->map);
> +		st_map = (struct bpf_struct_ops_map *)link->map;
> +		st_map->st_ops->unreg(&st_map->kvalue.data);
> +		bpf_struct_ops_kvalue_put(&st_map->kvalue);
>   		link->map = NULL;
>   	}
>   }
> @@ -735,13 +761,15 @@ static const struct bpf_link_ops  
> bpf_struct_ops_map_lops = {

>   int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
>   {
> +	struct bpf_struct_ops_map *st_map;
>   	struct bpf_link_primer link_primer;
> +	struct bpf_struct_ops_value *kvalue;
>   	struct bpf_map *map;
>   	struct bpf_link *link = NULL;
>   	int err;

>   	map = bpf_map_get(attr->link_create.prog_fd);
> -	if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
> +	if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(map->map_flags &  
> BPF_F_LINK))
>   		return -EINVAL;

>   	link = kzalloc(sizeof(*link), GFP_USER);
> @@ -752,6 +780,29 @@ int link_create_struct_ops_map(union bpf_attr *attr,  
> bpfptr_t uattr)
>   	bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops,  
> NULL);
>   	link->map = map;


[..]

> +	if (map->map_flags & BPF_F_LINK) {

We seem to bail out above when we don't have BPF_F_LINK flags above?

if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(map->map_flags &  
BPF_F_LINK))
	return -EINVAL;

So why check this 'if (map->map_flags & BPF_F_LINK)' condition here?


> +		st_map = (struct bpf_struct_ops_map *)map;
> +		kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
> +
> +		if (kvalue->state != BPF_STRUCT_OPS_STATE_INUSE ||
> +		    refcount_read(&kvalue->refcnt) != 0) {
> +			err = -EINVAL;
> +			goto err_out;
> +		}
> +
> +		refcount_set(&kvalue->refcnt, 1);
> +
> +		set_memory_rox((long)st_map->image, 1);
> +		err = st_map->st_ops->reg(kvalue->data);
> +		if (err) {
> +			refcount_set(&kvalue->refcnt, 0);
> +
> +			set_memory_nx((long)st_map->image, 1);
> +			set_memory_rw((long)st_map->image, 1);
> +			goto err_out;
> +		}
> +	}
> +
>   	err = bpf_link_prime(link, &link_primer);
>   	if (err)
>   		goto err_out;
> diff --git a/tools/include/uapi/linux/bpf.h  
> b/tools/include/uapi/linux/bpf.h
> index 1e6cdd0f355d..48d8b3058aa1 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1267,6 +1267,9 @@ enum {

>   /* Create a map that is suitable to be an inner map with dynamic max  
> entries */
>   	BPF_F_INNER_MAP		= (1U << 12),
> +
> +/* Create a map that will be registered/unregesitered by the backed  
> bpf_link */
> +	BPF_F_LINK		= (1U << 13),
>   };

>   /* Flags for BPF_PROG_QUERY. */
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-02-14 22:17 ` [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
@ 2023-02-15  2:58   ` Stanislav Fomichev
  2023-02-15 18:44     ` Kui-Feng Lee
  2023-02-16 22:40   ` Andrii Nakryiko
  1 sibling, 1 reply; 43+ messages in thread
From: Stanislav Fomichev @ 2023-02-15  2:58 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii

On 02/14, Kui-Feng Lee wrote:
> bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
> placeholder, but now it is constructing an authentic one by calling
> bpf_link_create() if the map has the BPF_F_LINK flag.

> You can flag a struct_ops map with BPF_F_LINK by calling
> bpf_map__set_map_flags().

> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   tools/lib/bpf/libbpf.c | 73 +++++++++++++++++++++++++++++++++---------
>   1 file changed, 58 insertions(+), 15 deletions(-)

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 75ed95b7e455..1eff6a03ddd9 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -11430,29 +11430,41 @@ struct bpf_link *bpf_program__attach(const  
> struct bpf_program *prog)
>   	return link;
>   }

> +struct bpf_link_struct_ops_map {
> +	struct bpf_link link;
> +	int map_fd;
> +};

Ah, ok, now you're adding bpf_link_struct_ops_map. I guess I'm now
confused why you haven't done it in the first patch :-/

And what are these fake bpf_links? Can you share more about it means?

> +
>   static int bpf_link__detach_struct_ops(struct bpf_link *link)
>   {
> +	struct bpf_link_struct_ops_map *st_link;
>   	__u32 zero = 0;

> -	if (bpf_map_delete_elem(link->fd, &zero))
> +	st_link = container_of(link, struct bpf_link_struct_ops_map, link);
> +
> +	if (st_link->map_fd < 0) {
> +		/* Fake bpf_link */
> +		if (bpf_map_delete_elem(link->fd, &zero))
> +			return -errno;
> +		return 0;
> +	}
> +
> +	if (bpf_map_delete_elem(st_link->map_fd, &zero))
> +		return -errno;
> +
> +	if (close(link->fd))
>   		return -errno;

>   	return 0;
>   }

> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> +/*
> + * Update the map with the prepared vdata.
> + */
> +static int bpf_map__update_vdata(const struct bpf_map *map)
>   {
>   	struct bpf_struct_ops *st_ops;
> -	struct bpf_link *link;
>   	__u32 i, zero = 0;
> -	int err;
> -
> -	if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> -		return libbpf_err_ptr(-EINVAL);
> -
> -	link = calloc(1, sizeof(*link));
> -	if (!link)
> -		return libbpf_err_ptr(-EINVAL);

>   	st_ops = map->st_ops;
>   	for (i = 0; i < btf_vlen(st_ops->type); i++) {
> @@ -11468,17 +11480,48 @@ struct bpf_link  
> *bpf_map__attach_struct_ops(const struct bpf_map *map)
>   		*(unsigned long *)kern_data = prog_fd;
>   	}

> -	err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
> +	return bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
> +}
> +
> +struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> +{
> +	struct bpf_link_struct_ops_map *link;
> +	int err, fd;
> +
> +	if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> +		return libbpf_err_ptr(-EINVAL);
> +
> +	link = calloc(1, sizeof(*link));
> +	if (!link)
> +		return libbpf_err_ptr(-EINVAL);
> +
> +	err = bpf_map__update_vdata(map);
>   	if (err) {
>   		err = -errno;
>   		free(link);
>   		return libbpf_err_ptr(err);
>   	}

> -	link->detach = bpf_link__detach_struct_ops;
> -	link->fd = map->fd;
> +	link->link.detach = bpf_link__detach_struct_ops;

> -	return link;
> +	if (!(map->def.map_flags & BPF_F_LINK)) {
> +		/* Fake bpf_link */
> +		link->link.fd = map->fd;
> +		link->map_fd = -1;
> +		return &link->link;
> +	}
> +
> +	fd = bpf_link_create(map->fd, -1, BPF_STRUCT_OPS_MAP, NULL);
> +	if (fd < 0) {
> +		err = -errno;
> +		free(link);
> +		return libbpf_err_ptr(err);
> +	}
> +
> +	link->link.fd = fd;
> +	link->map_fd = map->fd;
> +
> +	return &link->link;
>   }

>   typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct  
> perf_event_header *hdr,
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-15  2:39   ` Stanislav Fomichev
@ 2023-02-15 18:04     ` Kui-Feng Lee
  2023-02-15 18:44       ` Stanislav Fomichev
  2023-02-15 20:30       ` Martin KaFai Lau
  0 siblings, 2 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-15 18:04 UTC (permalink / raw)
  To: Stanislav Fomichev, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii

Thank you for the feedback.


On 2/14/23 18:39, Stanislav Fomichev wrote:
> On 02/14, Kui-Feng Lee wrote:
>> BPF struct_ops maps are employed directly to register TCP Congestion
>> Control algorithms. Unlike other BPF programs that terminate when
>> their links gone, the struct_ops program reduces its refcount solely
>> upon death of its FD. The link of a BPF struct_ops map provides a
>> uniform experience akin to other types of BPF programs.  A TCP
>> Congestion Control algorithm will be unregistered upon destroying the
>> FD in the following patches.
>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/linux/bpf.h            |  6 +++-
>>   include/uapi/linux/bpf.h       |  4 +++
>>   kernel/bpf/bpf_struct_ops.c    | 66 ++++++++++++++++++++++++++++++++++
>>   kernel/bpf/syscall.c           | 29 ++++++++++-----
>>   tools/include/uapi/linux/bpf.h |  4 +++
>>   tools/lib/bpf/bpf.c            |  2 ++
>>   tools/lib/bpf/libbpf.c         |  1 +
>>   7 files changed, 102 insertions(+), 10 deletions(-)
>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 8b5d0b4c4ada..13683584b071 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1391,7 +1391,11 @@ struct bpf_link {
>>       u32 id;
>>       enum bpf_link_type type;
>>       const struct bpf_link_ops *ops;
>> -    struct bpf_prog *prog;
>
> [..]
>
>> +    union {
>> +        struct bpf_prog *prog;
>> +        /* Backed by a struct_ops (type == 
>> BPF_LINK_UPDATE_STRUCT_OPS) */
>> +        struct bpf_map *map;
>> +    };
>
> Any reason you're not using the approach that has been used for other
> links where we create a wrapping structure + container_of?
>
> struct bpt_struct_ops_link {
>     struct bpf_link link;
>     struct bpf_map *map;
> };
>
`map` and `prog` are meant to be used separately, while `union` is 
designed for this purpose.

The `container_of` approach also works. While both `container_of` and 
`union` are often used, is there any factor that makes the former a 
better choice than the latter?

>>       struct work_struct work;
>>   };
>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 17afd2b35ee5..1e6cdd0f355d 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
>>       BPF_PERF_EVENT,
>>       BPF_TRACE_KPROBE_MULTI,
>>       BPF_LSM_CGROUP,
>> +    BPF_STRUCT_OPS_MAP,
>>       __MAX_BPF_ATTACH_TYPE
>>   };
>
>> @@ -6354,6 +6355,9 @@ struct bpf_link_info {
>>           struct {
>>               __u32 ifindex;
>>           } xdp;
>> +        struct {
>> +            __u32 map_id;
>> +        } struct_ops_map;
>>       };
>>   } __attribute__((aligned(8)));
>
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index ece9870cab68..621c8e24481a 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -698,3 +698,69 @@ void bpf_struct_ops_put(const void *kdata)
>>           call_rcu(&st_map->rcu, bpf_struct_ops_put_rcu);
>>       }
>>   }
>> +
>> +static void bpf_struct_ops_map_link_release(struct bpf_link *link)
>> +{
>> +    if (link->map) {
>> +        bpf_map_put(link->map);
>> +        link->map = NULL;
>> +    }
>> +}
>> +
>> +static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>> +{
>> +    kfree(link);
>> +}
>> +
>> +static void bpf_struct_ops_map_link_show_fdinfo(const struct 
>> bpf_link *link,
>> +                        struct seq_file *seq)
>> +{
>> +    seq_printf(seq, "map_id:\t%d\n",
>> +          link->map->id);
>> +}
>> +
>> +static int bpf_struct_ops_map_link_fill_link_info(const struct 
>> bpf_link *link,
>> +                           struct bpf_link_info *info)
>> +{
>> +    info->struct_ops_map.map_id = link->map->id;
>> +    return 0;
>> +}
>> +
>> +static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>> +    .release = bpf_struct_ops_map_link_release,
>> +    .dealloc = bpf_struct_ops_map_link_dealloc,
>> +    .show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
>> +    .fill_link_info = bpf_struct_ops_map_link_fill_link_info,
>> +};
>> +
>> +int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
>> +{
>
> [..]
>
>> +    struct bpf_link_primer link_primer;
>> +    struct bpf_map *map;
>> +    struct bpf_link *link = NULL;
>
> Are we still trying to keep reverse christmas trees?

Yes, I will reorder them.


>
>> +    int err;
>> +
>> +    map = bpf_map_get(attr->link_create.prog_fd);
>
> bpf_map_get can fail here?


We have already verified the `attach_type` of the link before calling 
this function, so an error should not occur. If it does happen, however, 
something truly unusual must be happening. To ensure maximum protection 
and avoid this issue in the future, I will add a check here as well.


>
>> +    if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
>> +        return -EINVAL;
>> +
>> +    link = kzalloc(sizeof(*link), GFP_USER);
>> +    if (!link) {
>> +        err = -ENOMEM;
>> +        goto err_out;
>> +    }
>> +    bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, 
>> &bpf_struct_ops_map_lops, NULL);
>> +    link->map = map;
>> +
>> +    err = bpf_link_prime(link, &link_primer);
>> +    if (err)
>> +        goto err_out;
>> +
>> +    return bpf_link_settle(&link_primer);
>> +
>> +err_out:
>> +    bpf_map_put(map);
>> +    kfree(link);
>> +    return err;
>> +}
>> +
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index cda8d00f3762..54e172d8f5d1 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2738,7 +2738,9 @@ static void bpf_link_free(struct bpf_link *link)
>>       if (link->prog) {
>>           /* detach BPF program, clean up used resources */
>>           link->ops->release(link);
>> -        bpf_prog_put(link->prog);
>> +        if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
>> +            bpf_prog_put(link->prog);
>> +        /* The struct_ops links clean up map by them-selves. */
>
> Why not more generic:
>
> if (link->prog)
>     bpf_prog_put(link->prog);
>
> ?
The `prog` and `map` functions are now occupying the same space. I'm 
afraid this check won't work.

>
>
>>       }
>>       /* free bpf_link and its containing memory */
>>       link->ops->dealloc(link);
>> @@ -2794,16 +2796,19 @@ static void bpf_link_show_fdinfo(struct 
>> seq_file *m, struct file *filp)
>>       const struct bpf_prog *prog = link->prog;
>>       char prog_tag[sizeof(prog->tag) * 2 + 1] = { };
>
>> -    bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
>>       seq_printf(m,
>>              "link_type:\t%s\n"
>> -           "link_id:\t%u\n"
>> -           "prog_tag:\t%s\n"
>> -           "prog_id:\t%u\n",
>> +           "link_id:\t%u\n",
>>              bpf_link_type_strs[link->type],
>> -           link->id,
>> -           prog_tag,
>> -           prog->aux->id);
>> +           link->id);
>> +    if (link->type != BPF_LINK_TYPE_STRUCT_OPS) {
>> +        bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
>> +        seq_printf(m,
>> +               "prog_tag:\t%s\n"
>> +               "prog_id:\t%u\n",
>> +               prog_tag,
>> +               prog->aux->id);
>> +    }
>>       if (link->ops->show_fdinfo)
>>           link->ops->show_fdinfo(link, m);
>>   }
>> @@ -4278,7 +4283,8 @@ static int bpf_link_get_info_by_fd(struct file 
>> *file,
>
>>       info.type = link->type;
>>       info.id = link->id;
>> -    info.prog_id = link->prog->aux->id;
>> +    if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
>> +        info.prog_id = link->prog->aux->id;
>
> Here as well: should we have "link->type != BPF_LINK_TYPE_STRUCT_OPS" vs
> "link->prog != NULL" ?


Same as above.  `map` and `prog` share the same memory space.


>
>
>>       if (link->ops->fill_link_info) {
>>           err = link->ops->fill_link_info(link, &info);
>> @@ -4531,6 +4537,8 @@ static int bpf_map_do_batch(const union 
>> bpf_attr *attr,
>>       return err;
>>   }
>
>> +extern int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t 
>> uattr);
>> +
>>   #define BPF_LINK_CREATE_LAST_FIELD link_create.kprobe_multi.cookies
>>   static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>>   {
>> @@ -4541,6 +4549,9 @@ static int link_create(union bpf_attr *attr, 
>> bpfptr_t uattr)
>>       if (CHECK_ATTR(BPF_LINK_CREATE))
>>           return -EINVAL;
>
>> +    if (attr->link_create.attach_type == BPF_STRUCT_OPS_MAP)
>> +        return link_create_struct_ops_map(attr, uattr);
>> +
>>       prog = bpf_prog_get(attr->link_create.prog_fd);
>>       if (IS_ERR(prog))
>>           return PTR_ERR(prog);
>> diff --git a/tools/include/uapi/linux/bpf.h 
>> b/tools/include/uapi/linux/bpf.h
>> index 17afd2b35ee5..1e6cdd0f355d 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
>>       BPF_PERF_EVENT,
>>       BPF_TRACE_KPROBE_MULTI,
>>       BPF_LSM_CGROUP,
>> +    BPF_STRUCT_OPS_MAP,
>>       __MAX_BPF_ATTACH_TYPE
>>   };
>
>> @@ -6354,6 +6355,9 @@ struct bpf_link_info {
>>           struct {
>>               __u32 ifindex;
>>           } xdp;
>> +        struct {
>> +            __u32 map_id;
>> +        } struct_ops_map;
>>       };
>>   } __attribute__((aligned(8)));
>
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index 9aff98f42a3d..e44d49f17c86 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
>> @@ -731,6 +731,8 @@ int bpf_link_create(int prog_fd, int target_fd,
>>           if (!OPTS_ZEROED(opts, tracing))
>>               return libbpf_err(-EINVAL);
>>           break;
>> +    case BPF_STRUCT_OPS_MAP:
>> +        break;
>>       default:
>>           if (!OPTS_ZEROED(opts, flags))
>>               return libbpf_err(-EINVAL);
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 35a698eb825d..75ed95b7e455 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -115,6 +115,7 @@ static const char * const attach_type_name[] = {
>>       [BPF_SK_REUSEPORT_SELECT_OR_MIGRATE]    = 
>> "sk_reuseport_select_or_migrate",
>>       [BPF_PERF_EVENT]        = "perf_event",
>>       [BPF_TRACE_KPROBE_MULTI]    = "trace_kprobe_multi",
>> +    [BPF_STRUCT_OPS_MAP]        = "struct_ops_map",
>>   };
>
>>   static const char * const link_type_name[] = {
>> -- 
>> 2.30.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 2/7] net: Update an existing TCP congestion control algorithm.
  2023-02-15  2:43   ` Stanislav Fomichev
@ 2023-02-15 18:15     ` Kui-Feng Lee
  0 siblings, 0 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-15 18:15 UTC (permalink / raw)
  To: Stanislav Fomichev, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii


On 2/14/23 18:43, Stanislav Fomichev wrote:
> On 02/14, Kui-Feng Lee wrote:
>> This feature lets you immediately transition to another congestion
>> control algorithm or implementation with the same name.  Once a name
>> is updated, new connections will apply this new algorithm.
>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/linux/bpf.h            |  1 +
>>   include/net/tcp.h              |  2 ++
>>   net/bpf/bpf_dummy_struct_ops.c |  6 ++++++
>>   net/ipv4/bpf_tcp_ca.c          |  6 ++++++
>>   net/ipv4/tcp_cong.c            | 39 ++++++++++++++++++++++++++++++++++
>>   5 files changed, 54 insertions(+)
>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 13683584b071..5fe39f56a760 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1450,6 +1450,7 @@ struct bpf_struct_ops {
>>                  void *kdata, const void *udata);
>>       int (*reg)(void *kdata);
>>       void (*unreg)(void *kdata);
>> +    int (*update)(void *kdata, void *old_kdata);
>>       const struct btf_type *type;
>>       const struct btf_type *value_type;
>>       const char *name;
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index db9f828e9d1e..239cc0e2639c 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -1117,6 +1117,8 @@ struct tcp_congestion_ops {
>
>>   int tcp_register_congestion_control(struct tcp_congestion_ops *type);
>>   void tcp_unregister_congestion_control(struct tcp_congestion_ops 
>> *type);
>> +int tcp_update_congestion_control(struct tcp_congestion_ops *type,
>> +                  struct tcp_congestion_ops *old_type);
>
>>   void tcp_assign_congestion_control(struct sock *sk);
>>   void tcp_init_congestion_control(struct sock *sk);
>> diff --git a/net/bpf/bpf_dummy_struct_ops.c 
>> b/net/bpf/bpf_dummy_struct_ops.c
>> index ff4f89a2b02a..158f14e240d0 100644
>> --- a/net/bpf/bpf_dummy_struct_ops.c
>> +++ b/net/bpf/bpf_dummy_struct_ops.c
>> @@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata)
>>   {
>>   }
>
>> +static int bpf_dummy_update(void *kdata, void *old_kdata)
>> +{
>> +    return -EOPNOTSUPP;
>> +}
>> +
>>   struct bpf_struct_ops bpf_bpf_dummy_ops = {
>>       .verifier_ops = &bpf_dummy_verifier_ops,
>>       .init = bpf_dummy_init,
>>       .check_member = bpf_dummy_ops_check_member,
>>       .init_member = bpf_dummy_init_member,
>>       .reg = bpf_dummy_reg,
>> +    .update = bpf_dummy_update,
>>       .unreg = bpf_dummy_unreg,
>>       .name = "bpf_dummy_ops",
>>   };
>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
>> index 13fc0c185cd9..66ce5fadfe42 100644
>> --- a/net/ipv4/bpf_tcp_ca.c
>> +++ b/net/ipv4/bpf_tcp_ca.c
>> @@ -266,10 +266,16 @@ static void bpf_tcp_ca_unreg(void *kdata)
>>       tcp_unregister_congestion_control(kdata);
>>   }
>
>> +static int bpf_tcp_ca_update(void *kdata, void *old_kdata)
>> +{
>> +    return tcp_update_congestion_control(kdata, old_kdata);
>> +}
>> +
>>   struct bpf_struct_ops bpf_tcp_congestion_ops = {
>>       .verifier_ops = &bpf_tcp_ca_verifier_ops,
>>       .reg = bpf_tcp_ca_reg,
>>       .unreg = bpf_tcp_ca_unreg,
>> +    .update = bpf_tcp_ca_update,
>>       .check_member = bpf_tcp_ca_check_member,
>>       .init_member = bpf_tcp_ca_init_member,
>>       .init = bpf_tcp_ca_init,
>> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
>> index db8b4b488c31..22fd7c12360e 100644
>> --- a/net/ipv4/tcp_cong.c
>> +++ b/net/ipv4/tcp_cong.c
>> @@ -130,6 +130,45 @@ void tcp_unregister_congestion_control(struct 
>> tcp_congestion_ops *ca)
>>   }
>>   EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
>
>> +/* Replace a registered old ca with a new one.
>> + *
>> + * The new ca must have the same name as the old one, that has been
>> + * registered.
>> + */
>> +int tcp_update_congestion_control(struct tcp_congestion_ops *ca, 
>> struct tcp_congestion_ops *old_ca)
>> +{
>> +    struct tcp_congestion_ops *existing;
>> +    int ret = 0;
>> +
>
> [..]
>
>> +    /* all algorithms must implement these */
>> +    if (!ca->ssthresh || !ca->undo_cwnd ||
>> +        !(ca->cong_avoid || ca->cong_control)) {
>> +        pr_err("%s does not implement required ops\n", old_ca->name);
>> +        return -EINVAL;
>> +    }
>> +
>> +    ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
>
> Can we have this as some common _validate method to avoid copy-paste
> from tcp_register_congestion_control.

Sure!


>
> Or, even better, can we can since function handle both cases?
>
> tcp_register_congestion_control(ca, old_ca);
>     - when old_ca == NULL -> register
>     - when old_ca != NULL -> try to update

We use this function in a lot of different places. To make it easier, I 
added a new function instead of changing the old one.


>
>> +
>> +    spin_lock(&tcp_cong_list_lock);
>> +    existing = tcp_ca_find_key(ca->key);
>> +    if (ca->key == TCP_CA_UNSPEC || !existing || 
>> strcmp(existing->name, ca->name)) {
>> +        pr_notice("%s not registered or non-unique key\n",
>> +              ca->name);
>> +        ret = -EINVAL;
>> +    } else if (existing != old_ca) {
>> +        pr_notice("invalid old congestion control algorithm to 
>> replace\n");
>> +        ret = -EINVAL;
>> +    } else {
>> +        list_del_rcu(&existing->list);
>> +        list_add_tail_rcu(&ca->list, &tcp_cong_list);
>> +        pr_debug("%s updated\n", ca->name);
>> +    }
>> +    spin_unlock(&tcp_cong_list_lock);
>> +
>> +    return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(tcp_update_congestion_control);
>> +
>>   u32 tcp_ca_get_key_by_name(struct net *net, const char *name, bool 
>> *ecn_ca)
>>   {
>>       const struct tcp_congestion_ops *ca;
>> -- 
>> 2.30.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links.
  2023-02-15  2:53   ` Stanislav Fomichev
@ 2023-02-15 18:29     ` Kui-Feng Lee
  0 siblings, 0 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-15 18:29 UTC (permalink / raw)
  To: Stanislav Fomichev, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii


On 2/14/23 18:53, Stanislav Fomichev wrote:
> On 02/14, Kui-Feng Lee wrote:
>> Registration via bpf_links ensures a uniform behavior, just like other
>> BPF programs.  BPF struct_ops were registered/unregistered when
>> updating/deleting their values.  Only the maps of struct_ops having
>> the BPF_F_LINK flag are allowed to back a bpf_link.
>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/uapi/linux/bpf.h       |  3 ++
>>   kernel/bpf/bpf_struct_ops.c    | 59 +++++++++++++++++++++++++++++++---
>>   tools/include/uapi/linux/bpf.h |  3 ++
>>   3 files changed, 61 insertions(+), 4 deletions(-)
>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 1e6cdd0f355d..48d8b3058aa1 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1267,6 +1267,9 @@ enum {
>
>>   /* Create a map that is suitable to be an inner map with dynamic 
>> max entries */
>>       BPF_F_INNER_MAP        = (1U << 12),
>> +
>> +/* Create a map that will be registered/unregesitered by the backed 
>> bpf_link */
>> +    BPF_F_LINK        = (1U << 13),
>>   };
>
>>   /* Flags for BPF_PROG_QUERY. */
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index 621c8e24481a..d16ca06cf09a 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -390,7 +390,7 @@ static int bpf_struct_ops_map_update_elem(struct 
>> bpf_map *map, void *key,
>
>>       mutex_lock(&st_map->lock);
>
>> -    if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
>> +    if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT || 
>> refcount_read(&kvalue->refcnt)) {
>>           err = -EBUSY;
>>           goto unlock;
>>       }
>> @@ -491,6 +491,12 @@ static int bpf_struct_ops_map_update_elem(struct 
>> bpf_map *map, void *key,
>>           *(unsigned long *)(udata + moff) = prog->aux->id;
>>       }
>
>> +    if (st_map->map.map_flags & BPF_F_LINK) {
>> +        /* Let bpf_link handle registration & unregistration. */
>> +        smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
>> +        goto unlock;
>> +    }
>> +
>>       refcount_set(&kvalue->refcnt, 1);
>>       bpf_map_inc(map);
>
>
> [..]
>
>> @@ -522,6 +528,7 @@ static int bpf_struct_ops_map_update_elem(struct 
>> bpf_map *map, void *key,
>>       kfree(tlinks);
>>       mutex_unlock(&st_map->lock);
>>       return err;
>> +
>>   }
>
>>   static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void 
>> *key)
>
> Seems like a left over hunk?


You are right.  I will remove it.


>
>> @@ -535,6 +542,8 @@ static int bpf_struct_ops_map_delete_elem(struct 
>> bpf_map *map, void *key)
>>                    BPF_STRUCT_OPS_STATE_TOBEFREE);
>>       switch (prev_state) {
>>       case BPF_STRUCT_OPS_STATE_INUSE:
>> +        if (st_map->map.map_flags & BPF_F_LINK)
>> +            return 0;
>> st_map->st_ops->unreg(&st_map->kvalue.data);
>>           if (refcount_dec_and_test(&st_map->kvalue.refcnt))
>>               bpf_map_put(map);
>> @@ -585,7 +594,7 @@ static void bpf_struct_ops_map_free(struct 
>> bpf_map *map)
>>   static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
>>   {
>>       if (attr->key_size != sizeof(unsigned int) || attr->max_entries 
>> != 1 ||
>> -        attr->map_flags || !attr->btf_vmlinux_value_type_id)
>> +        (attr->map_flags & ~BPF_F_LINK) || 
>> !attr->btf_vmlinux_value_type_id)
>>           return -EINVAL;
>>       return 0;
>>   }
>> @@ -638,6 +647,8 @@ static struct bpf_map 
>> *bpf_struct_ops_map_alloc(union bpf_attr *attr)
>>       set_vm_flush_reset_perms(st_map->image);
>>       bpf_map_init_from_attr(map, attr);
>
>
> [..]
>
>> +    map->map_flags |= attr->map_flags & BPF_F_LINK;
>
> You seem to have the following check above:
>
> if (.... (attr->map_flags & ~BPF_F_LINK) ...) return -EINVAL;
>
> And here you do:
>
> map->map_flags |= attr->map_flags & BPF_F_LINK;
>
> So maybe we can simplify to:
> map->map_flags |= attr->map_flags;
>
> ?

Great catch!


>
>> +
>>       return map;
>>   }
>
>> @@ -699,10 +710,25 @@ void bpf_struct_ops_put(const void *kdata)
>>       }
>>   }
>
>> +static void bpf_struct_ops_kvalue_put(struct bpf_struct_ops_value 
>> *kvalue)
>> +{
>> +    struct bpf_struct_ops_map *st_map;
>> +
>> +    if (refcount_dec_and_test(&kvalue->refcnt)) {
>> +        st_map = container_of(kvalue, struct bpf_struct_ops_map,
>> +                      kvalue);
>> +        bpf_map_put(&st_map->map);
>> +    }
>> +}
>> +
>>   static void bpf_struct_ops_map_link_release(struct bpf_link *link)
>>   {
>> +    struct bpf_struct_ops_map *st_map;
>> +
>>       if (link->map) {
>> -        bpf_map_put(link->map);
>> +        st_map = (struct bpf_struct_ops_map *)link->map;
>> + st_map->st_ops->unreg(&st_map->kvalue.data);
>> +        bpf_struct_ops_kvalue_put(&st_map->kvalue);
>>           link->map = NULL;
>>       }
>>   }
>> @@ -735,13 +761,15 @@ static const struct bpf_link_ops 
>> bpf_struct_ops_map_lops = {
>
>>   int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
>>   {
>> +    struct bpf_struct_ops_map *st_map;
>>       struct bpf_link_primer link_primer;
>> +    struct bpf_struct_ops_value *kvalue;
>>       struct bpf_map *map;
>>       struct bpf_link *link = NULL;
>>       int err;
>
>>       map = bpf_map_get(attr->link_create.prog_fd);
>> -    if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
>> +    if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(map->map_flags 
>> & BPF_F_LINK))
>>           return -EINVAL;
>
>>       link = kzalloc(sizeof(*link), GFP_USER);
>> @@ -752,6 +780,29 @@ int link_create_struct_ops_map(union bpf_attr 
>> *attr, bpfptr_t uattr)
>>       bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, 
>> &bpf_struct_ops_map_lops, NULL);
>>       link->map = map;
>
>
> [..]
>
>> +    if (map->map_flags & BPF_F_LINK) {
>
> We seem to bail out above when we don't have BPF_F_LINK flags above?
>
> if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(map->map_flags & 
> BPF_F_LINK))
>     return -EINVAL;
>
> So why check this 'if (map->map_flags & BPF_F_LINK)' condition here?


You are right! This check is not necessary anymore.


>
>
>> +        st_map = (struct bpf_struct_ops_map *)map;
>> +        kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
>> +
>> +        if (kvalue->state != BPF_STRUCT_OPS_STATE_INUSE ||
>> +            refcount_read(&kvalue->refcnt) != 0) {
>> +            err = -EINVAL;
>> +            goto err_out;
>> +        }
>> +
>> +        refcount_set(&kvalue->refcnt, 1);
>> +
>> +        set_memory_rox((long)st_map->image, 1);
>> +        err = st_map->st_ops->reg(kvalue->data);
>> +        if (err) {
>> +            refcount_set(&kvalue->refcnt, 0);
>> +
>> +            set_memory_nx((long)st_map->image, 1);
>> +            set_memory_rw((long)st_map->image, 1);
>> +            goto err_out;
>> +        }
>> +    }
>> +
>>       err = bpf_link_prime(link, &link_primer);
>>       if (err)
>>           goto err_out;
>> diff --git a/tools/include/uapi/linux/bpf.h 
>> b/tools/include/uapi/linux/bpf.h
>> index 1e6cdd0f355d..48d8b3058aa1 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -1267,6 +1267,9 @@ enum {
>
>>   /* Create a map that is suitable to be an inner map with dynamic 
>> max entries */
>>       BPF_F_INNER_MAP        = (1U << 12),
>> +
>> +/* Create a map that will be registered/unregesitered by the backed 
>> bpf_link */
>> +    BPF_F_LINK        = (1U << 13),
>>   };
>
>>   /* Flags for BPF_PROG_QUERY. */
>> -- 
>> 2.30.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-02-15  2:58   ` Stanislav Fomichev
@ 2023-02-15 18:44     ` Kui-Feng Lee
  2023-02-15 18:48       ` Stanislav Fomichev
  0 siblings, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-15 18:44 UTC (permalink / raw)
  To: Stanislav Fomichev, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii


On 2/14/23 18:58, Stanislav Fomichev wrote:
> On 02/14, Kui-Feng Lee wrote:
>> bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
>> placeholder, but now it is constructing an authentic one by calling
>> bpf_link_create() if the map has the BPF_F_LINK flag.
>
>> You can flag a struct_ops map with BPF_F_LINK by calling
>> bpf_map__set_map_flags().
>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   tools/lib/bpf/libbpf.c | 73 +++++++++++++++++++++++++++++++++---------
>>   1 file changed, 58 insertions(+), 15 deletions(-)
>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 75ed95b7e455..1eff6a03ddd9 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -11430,29 +11430,41 @@ struct bpf_link *bpf_program__attach(const 
>> struct bpf_program *prog)
>>       return link;
>>   }
>
>> +struct bpf_link_struct_ops_map {
>> +    struct bpf_link link;
>> +    int map_fd;
>> +};
>
> Ah, ok, now you're adding bpf_link_struct_ops_map. I guess I'm now
> confused why you haven't done it in the first patch :-/

Just won't to mix the libbpf part and kernel part in one patch.


>
> And what are these fake bpf_links? Can you share more about it means?

For the next version, I will detail it in the commit log. In a nutshell, 
before this point, there was no bpf_link for struct_ops. Libbpf 
attempted to create an equivalent interface to other BPF programs by 
providing a simulated bpf_link instead of a true one from the kernel; 
that fake bpf_link stores FDs associated with struct_ops maps rather 
than real bpf_links.


>
>> +
>>   static int bpf_link__detach_struct_ops(struct bpf_link *link)
>>   {
>> +    struct bpf_link_struct_ops_map *st_link;
>>       __u32 zero = 0;
>
>> -    if (bpf_map_delete_elem(link->fd, &zero))
>> +    st_link = container_of(link, struct bpf_link_struct_ops_map, link);
>> +
>> +    if (st_link->map_fd < 0) {
>> +        /* Fake bpf_link */
>> +        if (bpf_map_delete_elem(link->fd, &zero))
>> +            return -errno;
>> +        return 0;
>> +    }
>> +
>> +    if (bpf_map_delete_elem(st_link->map_fd, &zero))
>> +        return -errno;
>> +
>> +    if (close(link->fd))
>>           return -errno;
>
>>       return 0;
>>   }
>
>> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>> +/*
>> + * Update the map with the prepared vdata.
>> + */
>> +static int bpf_map__update_vdata(const struct bpf_map *map)
>>   {
>>       struct bpf_struct_ops *st_ops;
>> -    struct bpf_link *link;
>>       __u32 i, zero = 0;
>> -    int err;
>> -
>> -    if (!bpf_map__is_struct_ops(map) || map->fd == -1)
>> -        return libbpf_err_ptr(-EINVAL);
>> -
>> -    link = calloc(1, sizeof(*link));
>> -    if (!link)
>> -        return libbpf_err_ptr(-EINVAL);
>
>>       st_ops = map->st_ops;
>>       for (i = 0; i < btf_vlen(st_ops->type); i++) {
>> @@ -11468,17 +11480,48 @@ struct bpf_link 
>> *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>           *(unsigned long *)kern_data = prog_fd;
>>       }
>
>> -    err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
>> +    return bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
>> +}
>> +
>> +struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>> +{
>> +    struct bpf_link_struct_ops_map *link;
>> +    int err, fd;
>> +
>> +    if (!bpf_map__is_struct_ops(map) || map->fd == -1)
>> +        return libbpf_err_ptr(-EINVAL);
>> +
>> +    link = calloc(1, sizeof(*link));
>> +    if (!link)
>> +        return libbpf_err_ptr(-EINVAL);
>> +
>> +    err = bpf_map__update_vdata(map);
>>       if (err) {
>>           err = -errno;
>>           free(link);
>>           return libbpf_err_ptr(err);
>>       }
>
>> -    link->detach = bpf_link__detach_struct_ops;
>> -    link->fd = map->fd;
>> +    link->link.detach = bpf_link__detach_struct_ops;
>
>> -    return link;
>> +    if (!(map->def.map_flags & BPF_F_LINK)) {
>> +        /* Fake bpf_link */
>> +        link->link.fd = map->fd;
>> +        link->map_fd = -1;
>> +        return &link->link;
>> +    }
>> +
>> +    fd = bpf_link_create(map->fd, -1, BPF_STRUCT_OPS_MAP, NULL);
>> +    if (fd < 0) {
>> +        err = -errno;
>> +        free(link);
>> +        return libbpf_err_ptr(err);
>> +    }
>> +
>> +    link->link.fd = fd;
>> +    link->map_fd = map->fd;
>> +
>> +    return &link->link;
>>   }
>
>>   typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct 
>> perf_event_header *hdr,
>> -- 
>> 2.30.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-15 18:04     ` Kui-Feng Lee
@ 2023-02-15 18:44       ` Stanislav Fomichev
  2023-02-15 20:24         ` Kui-Feng Lee
  2023-02-15 20:30       ` Martin KaFai Lau
  1 sibling, 1 reply; 43+ messages in thread
From: Stanislav Fomichev @ 2023-02-15 18:44 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii

On 02/15, Kui-Feng Lee wrote:
> Thank you for the feedback.


> On 2/14/23 18:39, Stanislav Fomichev wrote:
> > On 02/14, Kui-Feng Lee wrote:
> > > BPF struct_ops maps are employed directly to register TCP Congestion
> > > Control algorithms. Unlike other BPF programs that terminate when
> > > their links gone, the struct_ops program reduces its refcount solely
> > > upon death of its FD. The link of a BPF struct_ops map provides a
> > > uniform experience akin to other types of BPF programs.� A TCP
> > > Congestion Control algorithm will be unregistered upon destroying the
> > > FD in the following patches.
> >
> > > Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> > > ---
> > > � include/linux/bpf.h����������� |� 6 +++-
> > > � include/uapi/linux/bpf.h������ |� 4 +++
> > > � kernel/bpf/bpf_struct_ops.c��� | 66  
> ++++++++++++++++++++++++++++++++++
> > > � kernel/bpf/syscall.c���������� | 29 ++++++++++-----
> > > � tools/include/uapi/linux/bpf.h |� 4 +++
> > > � tools/lib/bpf/bpf.c����������� |� 2 ++
> > > � tools/lib/bpf/libbpf.c�������� |� 1 +
> > > � 7 files changed, 102 insertions(+), 10 deletions(-)
> >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 8b5d0b4c4ada..13683584b071 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1391,7 +1391,11 @@ struct bpf_link {
> > > ����� u32 id;
> > > ����� enum bpf_link_type type;
> > > ����� const struct bpf_link_ops *ops;
> > > -��� struct bpf_prog *prog;
> >
> > [..]
> >
> > > +��� union {
> > > +������� struct bpf_prog *prog;
> > > +������� /* Backed by a struct_ops (type ==
> > > BPF_LINK_UPDATE_STRUCT_OPS) */
> > > +������� struct bpf_map *map;
> > > +��� };
> >
> > Any reason you're not using the approach that has been used for other
> > links where we create a wrapping structure + container_of?
> >
> > struct bpt_struct_ops_link {
> > ����struct bpf_link link;
> > ����struct bpf_map *map;
> > };
> >
> `map` and `prog` are meant to be used separately, while `union` is  
> designed
> for this purpose.

> The `container_of` approach also works. While both `container_of` and
> `union` are often used, is there any factor that makes the former a better
> choice than the latter?

I guess I'm missing something here. Because you seem to add that
container_of approach later on with 'fake' links. Maybe you can clarify
on the patch where I made that comment?


> > > ����� struct work_struct work;
> > > � };
> >
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 17afd2b35ee5..1e6cdd0f355d 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
> > > ����� BPF_PERF_EVENT,
> > > ����� BPF_TRACE_KPROBE_MULTI,
> > > ����� BPF_LSM_CGROUP,
> > > +��� BPF_STRUCT_OPS_MAP,
> > > ����� __MAX_BPF_ATTACH_TYPE
> > > � };
> >
> > > @@ -6354,6 +6355,9 @@ struct bpf_link_info {
> > > ��������� struct {
> > > ������������� __u32 ifindex;
> > > ��������� } xdp;
> > > +������� struct {
> > > +����������� __u32 map_id;
> > > +������� } struct_ops_map;
> > > ����� };
> > > � } __attribute__((aligned(8)));
> >
> > > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > > index ece9870cab68..621c8e24481a 100644
> > > --- a/kernel/bpf/bpf_struct_ops.c
> > > +++ b/kernel/bpf/bpf_struct_ops.c
> > > @@ -698,3 +698,69 @@ void bpf_struct_ops_put(const void *kdata)
> > > ��������� call_rcu(&st_map->rcu, bpf_struct_ops_put_rcu);
> > > ����� }
> > > � }
> > > +
> > > +static void bpf_struct_ops_map_link_release(struct bpf_link *link)
> > > +{
> > > +��� if (link->map) {
> > > +������� bpf_map_put(link->map);
> > > +������� link->map = NULL;
> > > +��� }
> > > +}
> > > +
> > > +static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
> > > +{
> > > +��� kfree(link);
> > > +}
> > > +
> > > +static void bpf_struct_ops_map_link_show_fdinfo(const struct
> > > bpf_link *link,
> > > +����������������������� struct seq_file *seq)
> > > +{
> > > +��� seq_printf(seq, "map_id:\t%d\n",
> > > +��������� link->map->id);
> > > +}
> > > +
> > > +static int bpf_struct_ops_map_link_fill_link_info(const struct
> > > bpf_link *link,
> > > +�������������������������� struct bpf_link_info *info)
> > > +{
> > > +��� info->struct_ops_map.map_id = link->map->id;
> > > +��� return 0;
> > > +}
> > > +
> > > +static const struct bpf_link_ops bpf_struct_ops_map_lops = {
> > > +��� .release = bpf_struct_ops_map_link_release,
> > > +��� .dealloc = bpf_struct_ops_map_link_dealloc,
> > > +��� .show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
> > > +��� .fill_link_info = bpf_struct_ops_map_link_fill_link_info,
> > > +};
> > > +
> > > +int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
> > > +{
> >
> > [..]
> >
> > > +��� struct bpf_link_primer link_primer;
> > > +��� struct bpf_map *map;
> > > +��� struct bpf_link *link = NULL;
> >
> > Are we still trying to keep reverse christmas trees?

> Yes, I will reorder them.


> >
> > > +��� int err;
> > > +
> > > +��� map = bpf_map_get(attr->link_create.prog_fd);
> >
> > bpf_map_get can fail here?


> We have already verified the `attach_type` of the link before calling this
> function, so an error should not occur. If it does happen, however,
> something truly unusual must be happening. To ensure maximum protection  
> and
> avoid this issue in the future, I will add a check here as well.

If we've already checked, it's fine not to check here. I haven't looked
at the real path, thanks for clarifying.


> >
> > > +��� if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
> > > +������� return -EINVAL;
> > > +
> > > +��� link = kzalloc(sizeof(*link), GFP_USER);
> > > +��� if (!link) {
> > > +������� err = -ENOMEM;
> > > +������� goto err_out;
> > > +��� }
> > > +��� bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS,
> > > &bpf_struct_ops_map_lops, NULL);
> > > +��� link->map = map;
> > > +
> > > +��� err = bpf_link_prime(link, &link_primer);
> > > +��� if (err)
> > > +������� goto err_out;
> > > +
> > > +��� return bpf_link_settle(&link_primer);
> > > +
> > > +err_out:
> > > +��� bpf_map_put(map);
> > > +��� kfree(link);
> > > +��� return err;
> > > +}
> > > +
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index cda8d00f3762..54e172d8f5d1 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -2738,7 +2738,9 @@ static void bpf_link_free(struct bpf_link *link)
> > > ����� if (link->prog) {
> > > ��������� /* detach BPF program, clean up used resources */
> > > ��������� link->ops->release(link);
> > > -������� bpf_prog_put(link->prog);
> > > +������� if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
> > > +����������� bpf_prog_put(link->prog);
> > > +������� /* The struct_ops links clean up map by them-selves. */
> >
> > Why not more generic:
> >
> > if (link->prog)
> > ����bpf_prog_put(link->prog);
> >
> > ?
> The `prog` and `map` functions are now occupying the same space. I'm  
> afraid
> this check won't work.

Hmm, good point. In this case: why not have separate prog/map pointers
instead of a union? Are we 100% sure struct_ops is unique enough
and there won't ever be another map-based links?

> >
> >
> > > ����� }
> > > ����� /* free bpf_link and its containing memory */
> > > ����� link->ops->dealloc(link);
> > > @@ -2794,16 +2796,19 @@ static void bpf_link_show_fdinfo(struct
> > > seq_file *m, struct file *filp)
> > > ����� const struct bpf_prog *prog = link->prog;
> > > ����� char prog_tag[sizeof(prog->tag) * 2 + 1] = { };
> >
> > > -��� bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
> > > ����� seq_printf(m,
> > > ������������ "link_type:\t%s\n"
> > > -���������� "link_id:\t%u\n"
> > > -���������� "prog_tag:\t%s\n"
> > > -���������� "prog_id:\t%u\n",
> > > +���������� "link_id:\t%u\n",
> > > ������������ bpf_link_type_strs[link->type],
> > > -���������� link->id,
> > > -���������� prog_tag,
> > > -���������� prog->aux->id);
> > > +���������� link->id);
> > > +��� if (link->type != BPF_LINK_TYPE_STRUCT_OPS) {
> > > +������� bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
> > > +������� seq_printf(m,
> > > +�������������� "prog_tag:\t%s\n"
> > > +�������������� "prog_id:\t%u\n",
> > > +�������������� prog_tag,
> > > +�������������� prog->aux->id);
> > > +��� }
> > > ����� if (link->ops->show_fdinfo)
> > > ��������� link->ops->show_fdinfo(link, m);
> > > � }
> > > @@ -4278,7 +4283,8 @@ static int bpf_link_get_info_by_fd(struct file
> > > *file,
> >
> > > ����� info.type = link->type;
> > > ����� info.id = link->id;
> > > -��� info.prog_id = link->prog->aux->id;
> > > +��� if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
> > > +������� info.prog_id = link->prog->aux->id;
> >
> > Here as well: should we have "link->type != BPF_LINK_TYPE_STRUCT_OPS" vs
> > "link->prog != NULL" ?


> Same as above.� `map` and `prog` share the same memory space.


> >
> >
> > > ����� if (link->ops->fill_link_info) {
> > > ��������� err = link->ops->fill_link_info(link, &info);
> > > @@ -4531,6 +4537,8 @@ static int bpf_map_do_batch(const union
> > > bpf_attr *attr,
> > > ����� return err;
> > > � }
> >
> > > +extern int link_create_struct_ops_map(union bpf_attr *attr,
> > > bpfptr_t uattr);
> > > +
> > > � #define BPF_LINK_CREATE_LAST_FIELD link_create.kprobe_multi.cookies
> > > � static int link_create(union bpf_attr *attr, bpfptr_t uattr)
> > > � {
> > > @@ -4541,6 +4549,9 @@ static int link_create(union bpf_attr *attr,
> > > bpfptr_t uattr)
> > > ����� if (CHECK_ATTR(BPF_LINK_CREATE))
> > > ��������� return -EINVAL;
> >
> > > +��� if (attr->link_create.attach_type == BPF_STRUCT_OPS_MAP)
> > > +������� return link_create_struct_ops_map(attr, uattr);
> > > +
> > > ����� prog = bpf_prog_get(attr->link_create.prog_fd);
> > > ����� if (IS_ERR(prog))
> > > ��������� return PTR_ERR(prog);
> > > diff --git a/tools/include/uapi/linux/bpf.h
> > > b/tools/include/uapi/linux/bpf.h
> > > index 17afd2b35ee5..1e6cdd0f355d 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
> > > ����� BPF_PERF_EVENT,
> > > ����� BPF_TRACE_KPROBE_MULTI,
> > > ����� BPF_LSM_CGROUP,
> > > +��� BPF_STRUCT_OPS_MAP,
> > > ����� __MAX_BPF_ATTACH_TYPE
> > > � };
> >
> > > @@ -6354,6 +6355,9 @@ struct bpf_link_info {
> > > ��������� struct {
> > > ������������� __u32 ifindex;
> > > ��������� } xdp;
> > > +������� struct {
> > > +����������� __u32 map_id;
> > > +������� } struct_ops_map;
> > > ����� };
> > > � } __attribute__((aligned(8)));
> >
> > > diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> > > index 9aff98f42a3d..e44d49f17c86 100644
> > > --- a/tools/lib/bpf/bpf.c
> > > +++ b/tools/lib/bpf/bpf.c
> > > @@ -731,6 +731,8 @@ int bpf_link_create(int prog_fd, int target_fd,
> > > ��������� if (!OPTS_ZEROED(opts, tracing))
> > > ������������� return libbpf_err(-EINVAL);
> > > ��������� break;
> > > +��� case BPF_STRUCT_OPS_MAP:
> > > +������� break;
> > > ����� default:
> > > ��������� if (!OPTS_ZEROED(opts, flags))
> > > ������������� return libbpf_err(-EINVAL);
> > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > > index 35a698eb825d..75ed95b7e455 100644
> > > --- a/tools/lib/bpf/libbpf.c
> > > +++ b/tools/lib/bpf/libbpf.c
> > > @@ -115,6 +115,7 @@ static const char * const attach_type_name[] = {
> > > ����� [BPF_SK_REUSEPORT_SELECT_OR_MIGRATE]��� =
> > > "sk_reuseport_select_or_migrate",
> > > ����� [BPF_PERF_EVENT]������� = "perf_event",
> > > ����� [BPF_TRACE_KPROBE_MULTI]��� = "trace_kprobe_multi",
> > > +��� [BPF_STRUCT_OPS_MAP]������� = "struct_ops_map",
> > > � };
> >
> > > � static const char * const link_type_name[] = {
> > > --
> > > 2.30.2
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-02-15 18:44     ` Kui-Feng Lee
@ 2023-02-15 18:48       ` Stanislav Fomichev
  2023-02-15 22:20         ` Kui-Feng Lee
  0 siblings, 1 reply; 43+ messages in thread
From: Stanislav Fomichev @ 2023-02-15 18:48 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii

On Wed, Feb 15, 2023 at 10:44 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
> On 2/14/23 18:58, Stanislav Fomichev wrote:
> > On 02/14, Kui-Feng Lee wrote:
> >> bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
> >> placeholder, but now it is constructing an authentic one by calling
> >> bpf_link_create() if the map has the BPF_F_LINK flag.
> >
> >> You can flag a struct_ops map with BPF_F_LINK by calling
> >> bpf_map__set_map_flags().
> >
> >> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> >> ---
> >>   tools/lib/bpf/libbpf.c | 73 +++++++++++++++++++++++++++++++++---------
> >>   1 file changed, 58 insertions(+), 15 deletions(-)
> >
> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> >> index 75ed95b7e455..1eff6a03ddd9 100644
> >> --- a/tools/lib/bpf/libbpf.c
> >> +++ b/tools/lib/bpf/libbpf.c
> >> @@ -11430,29 +11430,41 @@ struct bpf_link *bpf_program__attach(const
> >> struct bpf_program *prog)
> >>       return link;
> >>   }
> >
> >> +struct bpf_link_struct_ops_map {
> >> +    struct bpf_link link;
> >> +    int map_fd;
> >> +};
> >
> > Ah, ok, now you're adding bpf_link_struct_ops_map. I guess I'm now
> > confused why you haven't done it in the first patch :-/
>
> Just won't to mix the libbpf part and kernel part in one patch.

Ah, shoot, I completely missed that it's libbpf. So in this case, can
we use the same strategy for the kernel links? Instead of a union,
have an outer struct + container_of? If not, why not?


>
> >
> > And what are these fake bpf_links? Can you share more about it means?
>
> For the next version, I will detail it in the commit log. In a nutshell,
> before this point, there was no bpf_link for struct_ops. Libbpf
> attempted to create an equivalent interface to other BPF programs by
> providing a simulated bpf_link instead of a true one from the kernel;
> that fake bpf_link stores FDs associated with struct_ops maps rather
> than real bpf_links.
>
>
> >
> >> +
> >>   static int bpf_link__detach_struct_ops(struct bpf_link *link)
> >>   {
> >> +    struct bpf_link_struct_ops_map *st_link;
> >>       __u32 zero = 0;
> >
> >> -    if (bpf_map_delete_elem(link->fd, &zero))
> >> +    st_link = container_of(link, struct bpf_link_struct_ops_map, link);
> >> +
> >> +    if (st_link->map_fd < 0) {
> >> +        /* Fake bpf_link */
> >> +        if (bpf_map_delete_elem(link->fd, &zero))
> >> +            return -errno;
> >> +        return 0;
> >> +    }
> >> +
> >> +    if (bpf_map_delete_elem(st_link->map_fd, &zero))
> >> +        return -errno;
> >> +
> >> +    if (close(link->fd))
> >>           return -errno;
> >
> >>       return 0;
> >>   }
> >
> >> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> >> +/*
> >> + * Update the map with the prepared vdata.
> >> + */
> >> +static int bpf_map__update_vdata(const struct bpf_map *map)
> >>   {
> >>       struct bpf_struct_ops *st_ops;
> >> -    struct bpf_link *link;
> >>       __u32 i, zero = 0;
> >> -    int err;
> >> -
> >> -    if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> >> -        return libbpf_err_ptr(-EINVAL);
> >> -
> >> -    link = calloc(1, sizeof(*link));
> >> -    if (!link)
> >> -        return libbpf_err_ptr(-EINVAL);
> >
> >>       st_ops = map->st_ops;
> >>       for (i = 0; i < btf_vlen(st_ops->type); i++) {
> >> @@ -11468,17 +11480,48 @@ struct bpf_link
> >> *bpf_map__attach_struct_ops(const struct bpf_map *map)
> >>           *(unsigned long *)kern_data = prog_fd;
> >>       }
> >
> >> -    err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
> >> +    return bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
> >> +}
> >> +
> >> +struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> >> +{
> >> +    struct bpf_link_struct_ops_map *link;
> >> +    int err, fd;
> >> +
> >> +    if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> >> +        return libbpf_err_ptr(-EINVAL);
> >> +
> >> +    link = calloc(1, sizeof(*link));
> >> +    if (!link)
> >> +        return libbpf_err_ptr(-EINVAL);
> >> +
> >> +    err = bpf_map__update_vdata(map);
> >>       if (err) {
> >>           err = -errno;
> >>           free(link);
> >>           return libbpf_err_ptr(err);
> >>       }
> >
> >> -    link->detach = bpf_link__detach_struct_ops;
> >> -    link->fd = map->fd;
> >> +    link->link.detach = bpf_link__detach_struct_ops;
> >
> >> -    return link;
> >> +    if (!(map->def.map_flags & BPF_F_LINK)) {
> >> +        /* Fake bpf_link */
> >> +        link->link.fd = map->fd;
> >> +        link->map_fd = -1;
> >> +        return &link->link;
> >> +    }
> >> +
> >> +    fd = bpf_link_create(map->fd, -1, BPF_STRUCT_OPS_MAP, NULL);
> >> +    if (fd < 0) {
> >> +        err = -errno;
> >> +        free(link);
> >> +        return libbpf_err_ptr(err);
> >> +    }
> >> +
> >> +    link->link.fd = fd;
> >> +    link->map_fd = map->fd;
> >> +
> >> +    return &link->link;
> >>   }
> >
> >>   typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct
> >> perf_event_header *hdr,
> >> --
> >> 2.30.2
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-15 18:44       ` Stanislav Fomichev
@ 2023-02-15 20:24         ` Kui-Feng Lee
  2023-02-15 21:28           ` Martin KaFai Lau
  0 siblings, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-15 20:24 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii


On 2/15/23 10:44, Stanislav Fomichev wrote:
> On 02/15, Kui-Feng Lee wrote:
>> Thank you for the feedback.
>
>
>> On 2/14/23 18:39, Stanislav Fomichev wrote:
>> > On 02/14, Kui-Feng Lee wrote:
[..]
>> >
>> > > +��� if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
>> > > +������� return -EINVAL;
>> > > +
>> > > +��� link = kzalloc(sizeof(*link), GFP_USER);
>> > > +��� if (!link) {
>> > > +������� err = -ENOMEM;
>> > > +������� goto err_out;
>> > > +��� }
>> > > +��� bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS,
>> > > &bpf_struct_ops_map_lops, NULL);
>> > > +��� link->map = map;
>> > > +
>> > > +��� err = bpf_link_prime(link, &link_primer);
>> > > +��� if (err)
>> > > +������� goto err_out;
>> > > +
>> > > +��� return bpf_link_settle(&link_primer);
>> > > +
>> > > +err_out:
>> > > +��� bpf_map_put(map);
>> > > +��� kfree(link);
>> > > +��� return err;
>> > > +}
>> > > +
>> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> > > index cda8d00f3762..54e172d8f5d1 100644
>> > > --- a/kernel/bpf/syscall.c
>> > > +++ b/kernel/bpf/syscall.c
>> > > @@ -2738,7 +2738,9 @@ static void bpf_link_free(struct bpf_link 
>> *link)
>> > > ����� if (link->prog) {
>> > > ��������� /* detach BPF program, clean up used resources */
>> > > ��������� link->ops->release(link);
>> > > -������� bpf_prog_put(link->prog);
>> > > +������� if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
>> > > +����������� bpf_prog_put(link->prog);
>> > > +������� /* The struct_ops links clean up map by them-selves. */
>> >
>> > Why not more generic:
>> >
>> > if (link->prog)
>> > ����bpf_prog_put(link->prog);
>> >
>> > ?
>> The `prog` and `map` functions are now occupying the same space. I'm 
>> afraid
>> this check won't work.
>
> Hmm, good point. In this case: why not have separate prog/map pointers
> instead of a union? Are we 100% sure struct_ops is unique enough
> and there won't ever be another map-based links?
>
Good question!

bpf_link is used to attach a single BPF program with a hook now. This 
patch takes things one step further, allowing the attachment of 
struct_ops. We can think of it as attaching a set of BPF programs to a 
pre-existing set of hooks. I would say that bpf_link now represents the 
attachments of a set of BPF programs to a pre-defined set of hooks. The 
size of a set is one for the case of attaching single BPF program.

Is creating a new map-based link indicative of introducing an entirely 
distinct type of map that reflects a set of BPF programs? If so, why 
doesn't struct_ops work? If it happens, we may need to create a function 
to validate the attach_type associated with any given map type.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-15 18:04     ` Kui-Feng Lee
  2023-02-15 18:44       ` Stanislav Fomichev
@ 2023-02-15 20:30       ` Martin KaFai Lau
  2023-02-15 20:55         ` Kui-Feng Lee
  1 sibling, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2023-02-15 20:30 UTC (permalink / raw)
  To: Kui-Feng Lee, Kui-Feng Lee
  Cc: bpf, ast, song, kernel-team, andrii, Stanislav Fomichev

On 2/15/23 10:04 AM, Kui-Feng Lee wrote:
>>> +    int err;
>>> +
>>> +    map = bpf_map_get(attr->link_create.prog_fd);
>>
>> bpf_map_get can fail here?
> 
> 
> We have already verified the `attach_type` of the link before calling this 
> function, so an error should not occur. If it does happen, however, something 
> truly unusual must be happening. To ensure maximum protection and avoid this 
> issue in the future, I will add a check here as well.

bpf_map_get() could fail. A valid attach_type does not mean prog_fd (actually 
map_fd here) is also valid.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-15 20:30       ` Martin KaFai Lau
@ 2023-02-15 20:55         ` Kui-Feng Lee
  0 siblings, 0 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-15 20:55 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee
  Cc: bpf, ast, song, kernel-team, andrii, Stanislav Fomichev

On 2/15/23 12:30, Martin KaFai Lau wrote:
> On 2/15/23 10:04 AM, Kui-Feng Lee wrote:
>>>> +    int err;
>>>> +
>>>> +    map = bpf_map_get(attr->link_create.prog_fd);
>>>
>>> bpf_map_get can fail here?
>>
>>
>> We have already verified the `attach_type` of the link before calling 
>> this function, so an error should not occur. If it does happen, 
>> however, something truly unusual must be happening. To ensure maximum 
>> protection and avoid this issue in the future, I will add a check 
>> here as well.
>
> bpf_map_get() could fail. A valid attach_type does not mean prog_fd 
> (actually map_fd here) is also valid.

You are right! I must mess up with the update one.

I will add a check.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-15 20:24         ` Kui-Feng Lee
@ 2023-02-15 21:28           ` Martin KaFai Lau
  0 siblings, 0 replies; 43+ messages in thread
From: Martin KaFai Lau @ 2023-02-15 21:28 UTC (permalink / raw)
  To: Kui-Feng Lee, Kui-Feng Lee
  Cc: bpf, ast, song, kernel-team, andrii, Stanislav Fomichev

On 2/15/23 12:24 PM, Kui-Feng Lee wrote:
> 
> On 2/15/23 10:44, Stanislav Fomichev wrote:
>> On 02/15, Kui-Feng Lee wrote:
>>> Thank you for the feedback.
>>
>>
>>> On 2/14/23 18:39, Stanislav Fomichev wrote:
>>> > On 02/14, Kui-Feng Lee wrote:
> [..]
>>> >
>>> > > +��� if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
>>> > > +������� return -EINVAL;
>>> > > +
>>> > > +��� link = kzalloc(sizeof(*link), GFP_USER);
>>> > > +��� if (!link) {
>>> > > +������� err = -ENOMEM;
>>> > > +������� goto err_out;
>>> > > +��� }
>>> > > +��� bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS,
>>> > > &bpf_struct_ops_map_lops, NULL);
>>> > > +��� link->map = map;
>>> > > +
>>> > > +��� err = bpf_link_prime(link, &link_primer);
>>> > > +��� if (err)
>>> > > +������� goto err_out;
>>> > > +
>>> > > +��� return bpf_link_settle(&link_primer);
>>> > > +
>>> > > +err_out:
>>> > > +��� bpf_map_put(map);
>>> > > +��� kfree(link);
>>> > > +��� return err;
>>> > > +}
>>> > > +
>>> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> > > index cda8d00f3762..54e172d8f5d1 100644
>>> > > --- a/kernel/bpf/syscall.c
>>> > > +++ b/kernel/bpf/syscall.c
>>> > > @@ -2738,7 +2738,9 @@ static void bpf_link_free(struct bpf_link *link)
>>> > > ����� if (link->prog) {
>>> > > ��������� /* detach BPF program, clean up used resources */
>>> > > ��������� link->ops->release(link);
>>> > > -������� bpf_prog_put(link->prog);
>>> > > +������� if (link->type != BPF_LINK_TYPE_STRUCT_OPS)
>>> > > +����������� bpf_prog_put(link->prog);
>>> > > +������� /* The struct_ops links clean up map by them-selves. */
>>> >
>>> > Why not more generic:
>>> >
>>> > if (link->prog)
>>> > ����bpf_prog_put(link->prog);
>>> >
>>> > ?
>>> The `prog` and `map` functions are now occupying the same space. I'm afraid
>>> this check won't work.
>>
>> Hmm, good point. In this case: why not have separate prog/map pointers
>> instead of a union? Are we 100% sure struct_ops is unique enough
>> and there won't ever be another map-based links?
>>
> Good question!
> 
> bpf_link is used to attach a single BPF program with a hook now. This patch 
> takes things one step further, allowing the attachment of struct_ops. We can 
> think of it as attaching a set of BPF programs to a pre-existing set of hooks. I 
> would say that bpf_link now represents the attachments of a set of BPF programs 
> to a pre-defined set of hooks. The size of a set is one for the case of 
> attaching single BPF program.
> 
> Is creating a new map-based link indicative of introducing an entirely distinct 
> type of map that reflects a set of BPF programs? If so, why doesn't struct_ops 
> work? If it happens, we may need to create a function to validate the 
> attach_type associated with any given map type.
> 

I also prefer separating 'prog' and 'map' in the link. It may be only struct_ops 
link that has no prog now but the future link type may not have prog also, so 
testing link->prog is less tie-up to a specific link type. Once separated, it 
makes sense to push the link's specific field to a new struct, so the following 
(from Stan) makes more sense:

struct bpt_struct_ops_link {
     struct bpf_link link;
     struct bpf_map *map;
};

In bpf_link_free(), there is no need to do an extra check for 'link->type != 
BPF_LINK_TYPE_STRUCT_OPS'. bpf_struct_ops_map_link_release() can be done 
together in bpf_struct_ops_map_link_dealloc().

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-02-15 18:48       ` Stanislav Fomichev
@ 2023-02-15 22:20         ` Kui-Feng Lee
  0 siblings, 0 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-15 22:20 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii



On 2/15/23 10:48, Stanislav Fomichev wrote:
> On Wed, Feb 15, 2023 at 10:44 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>> On 2/14/23 18:58, Stanislav Fomichev wrote:
>>> On 02/14, Kui-Feng Lee wrote:
>>>> bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
>>>> placeholder, but now it is constructing an authentic one by calling
>>>> bpf_link_create() if the map has the BPF_F_LINK flag.
>>>
>>>> You can flag a struct_ops map with BPF_F_LINK by calling
>>>> bpf_map__set_map_flags().
>>>
>>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>>>> ---
>>>>    tools/lib/bpf/libbpf.c | 73 +++++++++++++++++++++++++++++++++---------
>>>>    1 file changed, 58 insertions(+), 15 deletions(-)
>>>
>>>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>>>> index 75ed95b7e455..1eff6a03ddd9 100644
>>>> --- a/tools/lib/bpf/libbpf.c
>>>> +++ b/tools/lib/bpf/libbpf.c
>>>> @@ -11430,29 +11430,41 @@ struct bpf_link *bpf_program__attach(const
>>>> struct bpf_program *prog)
>>>>        return link;
>>>>    }
>>>
>>>> +struct bpf_link_struct_ops_map {
>>>> +    struct bpf_link link;
>>>> +    int map_fd;
>>>> +};
>>>
>>> Ah, ok, now you're adding bpf_link_struct_ops_map. I guess I'm now
>>> confused why you haven't done it in the first patch :-/
>>
>> Just won't to mix the libbpf part and kernel part in one patch.
> 
> Ah, shoot, I completely missed that it's libbpf. So in this case, can
> we use the same strategy for the kernel links? Instead of a union,
> have an outer struct + container_of? If not, why not?

The reason I use `container_of` here is we need both FDs in libbpf to 
keep it as consistent with its existing behavior as possible.  The value 
of the struct_ops map should be deleted if a bpf_link is detached.

Back to your question.  We can go the `container_of` approach.  Only 
concern I have is additional few bytes although it is not a big issue. I 
will move to this approach in the next version.

> 
> 
>>
>>>
>>> And what are these fake bpf_links? Can you share more about it means?
>>
>> For the next version, I will detail it in the commit log. In a nutshell,
>> before this point, there was no bpf_link for struct_ops. Libbpf
>> attempted to create an equivalent interface to other BPF programs by
>> providing a simulated bpf_link instead of a true one from the kernel;
>> that fake bpf_link stores FDs associated with struct_ops maps rather
>> than real bpf_links.
>>
>>
>>>
>>>> +
>>>>    static int bpf_link__detach_struct_ops(struct bpf_link *link)
>>>>    {
>>>> +    struct bpf_link_struct_ops_map *st_link;
>>>>        __u32 zero = 0;
>>>
>>>> -    if (bpf_map_delete_elem(link->fd, &zero))
>>>> +    st_link = container_of(link, struct bpf_link_struct_ops_map, link);
>>>> +
>>>> +    if (st_link->map_fd < 0) {
>>>> +        /* Fake bpf_link */
>>>> +        if (bpf_map_delete_elem(link->fd, &zero))
>>>> +            return -errno;
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    if (bpf_map_delete_elem(st_link->map_fd, &zero))
>>>> +        return -errno;
>>>> +
>>>> +    if (close(link->fd))
>>>>            return -errno;
>>>
>>>>        return 0;
>>>>    }
>>>
>>>> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>>> +/*
>>>> + * Update the map with the prepared vdata.
>>>> + */
>>>> +static int bpf_map__update_vdata(const struct bpf_map *map)
>>>>    {
>>>>        struct bpf_struct_ops *st_ops;
>>>> -    struct bpf_link *link;
>>>>        __u32 i, zero = 0;
>>>> -    int err;
>>>> -
>>>> -    if (!bpf_map__is_struct_ops(map) || map->fd == -1)
>>>> -        return libbpf_err_ptr(-EINVAL);
>>>> -
>>>> -    link = calloc(1, sizeof(*link));
>>>> -    if (!link)
>>>> -        return libbpf_err_ptr(-EINVAL);
>>>
>>>>        st_ops = map->st_ops;
>>>>        for (i = 0; i < btf_vlen(st_ops->type); i++) {
>>>> @@ -11468,17 +11480,48 @@ struct bpf_link
>>>> *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>>>            *(unsigned long *)kern_data = prog_fd;
>>>>        }
>>>
>>>> -    err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
>>>> +    return bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
>>>> +}
>>>> +
>>>> +struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>>> +{
>>>> +    struct bpf_link_struct_ops_map *link;
>>>> +    int err, fd;
>>>> +
>>>> +    if (!bpf_map__is_struct_ops(map) || map->fd == -1)
>>>> +        return libbpf_err_ptr(-EINVAL);
>>>> +
>>>> +    link = calloc(1, sizeof(*link));
>>>> +    if (!link)
>>>> +        return libbpf_err_ptr(-EINVAL);
>>>> +
>>>> +    err = bpf_map__update_vdata(map);
>>>>        if (err) {
>>>>            err = -errno;
>>>>            free(link);
>>>>            return libbpf_err_ptr(err);
>>>>        }
>>>
>>>> -    link->detach = bpf_link__detach_struct_ops;
>>>> -    link->fd = map->fd;
>>>> +    link->link.detach = bpf_link__detach_struct_ops;
>>>
>>>> -    return link;
>>>> +    if (!(map->def.map_flags & BPF_F_LINK)) {
>>>> +        /* Fake bpf_link */
>>>> +        link->link.fd = map->fd;
>>>> +        link->map_fd = -1;
>>>> +        return &link->link;
>>>> +    }
>>>> +
>>>> +    fd = bpf_link_create(map->fd, -1, BPF_STRUCT_OPS_MAP, NULL);
>>>> +    if (fd < 0) {
>>>> +        err = -errno;
>>>> +        free(link);
>>>> +        return libbpf_err_ptr(err);
>>>> +    }
>>>> +
>>>> +    link->link.fd = fd;
>>>> +    link->map_fd = map->fd;
>>>> +
>>>> +    return &link->link;
>>>>    }
>>>
>>>>    typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct
>>>> perf_event_header *hdr,
>>>> --
>>>> 2.30.2
>>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-14 22:17 ` [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
  2023-02-15  0:26   ` kernel test robot
  2023-02-15  2:39   ` Stanislav Fomichev
@ 2023-02-15 22:58   ` Martin KaFai Lau
  2023-02-16 17:59     ` Kui-Feng Lee
  2 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2023-02-15 22:58 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii

On 2/14/23 2:17 PM, Kui-Feng Lee wrote:
> BPF struct_ops maps are employed directly to register TCP Congestion
> Control algorithms. Unlike other BPF programs that terminate when
> their links gone, the struct_ops program reduces its refcount solely
> upon death of its FD. 

I think the refcount comment probably not needed for this patch.

> The link of a BPF struct_ops map provides a
> uniform experience akin to other types of BPF programs.  A TCP
> Congestion Control algorithm will be unregistered upon destroying the
> FD in the following patches.


> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 17afd2b35ee5..1e6cdd0f355d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h

The existing BPF_LINK_TYPE_STRUCT_OPS enum is reused. Please explain why it can 
be reused in the commit message and also add comments around the existing 
"bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS...)" in 
bpf_struct_ops_map_update_elem().

> @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
>   	BPF_PERF_EVENT,
>   	BPF_TRACE_KPROBE_MULTI,
>   	BPF_LSM_CGROUP,
> +	BPF_STRUCT_OPS_MAP,

nit. Only BPF_STRUCT_OPS. No need for "_MAP".

>   	__MAX_BPF_ATTACH_TYPE
>   };
>   
> @@ -6354,6 +6355,9 @@ struct bpf_link_info {
>   		struct {
>   			__u32 ifindex;
>   		} xdp;
> +		struct {
> +			__u32 map_id;
> +		} struct_ops_map;

nit. Same here, skip the "_map";

This looks good instead of union. In case the user space tool directly uses the 
prog_id without checking the type.

>   	};
>   } __attribute__((aligned(8)));
>   
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index ece9870cab68..621c8e24481a 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -698,3 +698,69 @@ void bpf_struct_ops_put(const void *kdata)
>   		call_rcu(&st_map->rcu, bpf_struct_ops_put_rcu);
>   	}
>   }
> +
> +static void bpf_struct_ops_map_link_release(struct bpf_link *link)
> +{
> +	if (link->map) {
> +		bpf_map_put(link->map);
> +		link->map = NULL;
> +	}
> +}
> +
> +static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
> +{
> +	kfree(link);
> +}
> +
> +static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
> +					    struct seq_file *seq)
> +{
> +	seq_printf(seq, "map_id:\t%d\n",
> +		  link->map->id);
> +}
> +
> +static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
> +					       struct bpf_link_info *info)
> +{
> +	info->struct_ops_map.map_id = link->map->id;
> +	return 0;
> +}
> +
> +static const struct bpf_link_ops bpf_struct_ops_map_lops = {
> +	.release = bpf_struct_ops_map_link_release,
> +	.dealloc = bpf_struct_ops_map_link_dealloc,
> +	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
> +	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
> +};

Can .detach be supported also?

> +
> +int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)

Does it need uattr?

nit. Rename to bpf_struct_ops_link_attach(), like how other link type's "attach" 
functions are named. or may be even just bpf_struct_ops_attach().

> +{
> +	struct bpf_link_primer link_primer;
> +	struct bpf_map *map;
> +	struct bpf_link *link = NULL;
> +	int err;
> +
> +	map = bpf_map_get(attr->link_create.prog_fd);

This one looks weird. passing prog_fd to bpf_map_get(). I think in this case it 
makes sense to union map_fd with prog_fd in attr->link_create ?

> +	if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)

map is leaked.

> +		return -EINVAL;
> +
> +	link = kzalloc(sizeof(*link), GFP_USER);
> +	if (!link) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
> +	link->map = map;
> +
> +	err = bpf_link_prime(link, &link_primer);
> +	if (err)
> +		goto err_out;
> +
> +	return bpf_link_settle(&link_primer);
> +
> +err_out:
> +	bpf_map_put(map);
> +	kfree(link);
> +	return err;
> +}
> +

[ ... ]

> +extern int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr);

Move it to bpf.h.

> +
>   #define BPF_LINK_CREATE_LAST_FIELD link_create.kprobe_multi.cookies
>   static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>   {
> @@ -4541,6 +4549,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>   	if (CHECK_ATTR(BPF_LINK_CREATE))
>   		return -EINVAL;
>   
> +	if (attr->link_create.attach_type == BPF_STRUCT_OPS_MAP)
> +		return link_create_struct_ops_map(attr, uattr);
> +
>   	prog = bpf_prog_get(attr->link_create.prog_fd);
>   	if (IS_ERR(prog))
>   		return PTR_ERR(prog);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 17afd2b35ee5..1e6cdd0f355d 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
>   	BPF_PERF_EVENT,
>   	BPF_TRACE_KPROBE_MULTI,
>   	BPF_LSM_CGROUP,
> +	BPF_STRUCT_OPS_MAP,
>   	__MAX_BPF_ATTACH_TYPE
>   };
>   
> @@ -6354,6 +6355,9 @@ struct bpf_link_info {
>   		struct {
>   			__u32 ifindex;
>   		} xdp;
> +		struct {
> +			__u32 map_id;
> +		} struct_ops_map;
>   	};
>   } __attribute__((aligned(8)));
>   
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 9aff98f42a3d..e44d49f17c86 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c

Not necessary in this set and could be a follow up. Have you thought about how 
to generate a skel including the struct_ops link?

> @@ -731,6 +731,8 @@ int bpf_link_create(int prog_fd, int target_fd,
>   		if (!OPTS_ZEROED(opts, tracing))
>   			return libbpf_err(-EINVAL);
>   		break;
> +	case BPF_STRUCT_OPS_MAP:
> +		break;
>   	default:
>   		if (!OPTS_ZEROED(opts, flags))
>   			return libbpf_err(-EINVAL);
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 35a698eb825d..75ed95b7e455 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -115,6 +115,7 @@ static const char * const attach_type_name[] = {
>   	[BPF_SK_REUSEPORT_SELECT_OR_MIGRATE]	= "sk_reuseport_select_or_migrate",
>   	[BPF_PERF_EVENT]		= "perf_event",
>   	[BPF_TRACE_KPROBE_MULTI]	= "trace_kprobe_multi",
> +	[BPF_STRUCT_OPS_MAP]		= "struct_ops_map",
>   };
>   
>   static const char * const link_type_name[] = {


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links.
  2023-02-14 22:17 ` [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links Kui-Feng Lee
  2023-02-15  2:53   ` Stanislav Fomichev
@ 2023-02-16  0:37   ` Martin KaFai Lau
  2023-02-16 16:42     ` Kui-Feng Lee
  1 sibling, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2023-02-16  0:37 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii

On 2/14/23 2:17 PM, Kui-Feng Lee wrote:
> Registration via bpf_links ensures a uniform behavior, just like other
> BPF programs.  BPF struct_ops were registered/unregistered when
> updating/deleting their values.  Only the maps of struct_ops having
> the BPF_F_LINK flag are allowed to back a bpf_link.
> 
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   include/uapi/linux/bpf.h       |  3 ++
>   kernel/bpf/bpf_struct_ops.c    | 59 +++++++++++++++++++++++++++++++---
>   tools/include/uapi/linux/bpf.h |  3 ++
>   3 files changed, 61 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 1e6cdd0f355d..48d8b3058aa1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1267,6 +1267,9 @@ enum {
>   
>   /* Create a map that is suitable to be an inner map with dynamic max entries */
>   	BPF_F_INNER_MAP		= (1U << 12),
> +
> +/* Create a map that will be registered/unregesitered by the backed bpf_link */
> +	BPF_F_LINK		= (1U << 13),
>   };
>   
>   /* Flags for BPF_PROG_QUERY. */
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 621c8e24481a..d16ca06cf09a 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -390,7 +390,7 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>   
>   	mutex_lock(&st_map->lock);
>   
> -	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
> +	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT || refcount_read(&kvalue->refcnt)) {

Why it needs a new refcount_read(&kvalue->refcnt) check?

>   		err = -EBUSY;
>   		goto unlock;
>   	}
> @@ -491,6 +491,12 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>   		*(unsigned long *)(udata + moff) = prog->aux->id;
>   	}
>   
> +	if (st_map->map.map_flags & BPF_F_LINK) {
> +		/* Let bpf_link handle registration & unregistration. */
> +		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);

INUSE is for registered struct_ops. It needs a new UNREG state to mean 
initialized but not registered. The kvalue->state is not in uapi but the user 
space can still introspect it (thanks to BTF), so having a correct semantic 
state is useful. Try 'bpftool struct_ops dump ...':

     "bpf_struct_ops_tcp_congestion_ops": {
         "refcnt": {
             "refs": {
                 "counter": 1
             }
         },
         "state": "BPF_STRUCT_OPS_STATE_INUSE",

> +		goto unlock;
> +	}
> +
>   	refcount_set(&kvalue->refcnt, 1);
>   	bpf_map_inc(map);
>   
> @@ -522,6 +528,7 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>   	kfree(tlinks);
>   	mutex_unlock(&st_map->lock);
>   	return err;
> +

Unnecessary new line.

>   }
>   
>   static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
> @@ -535,6 +542,8 @@ static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
>   			     BPF_STRUCT_OPS_STATE_TOBEFREE);
>   	switch (prev_state) {
>   	case BPF_STRUCT_OPS_STATE_INUSE:
> +		if (st_map->map.map_flags & BPF_F_LINK)
> +			return 0;

This should be a -ENOTSUPP.

>   		st_map->st_ops->unreg(&st_map->kvalue.data);
>   		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
>   			bpf_map_put(map);
> @@ -585,7 +594,7 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>   static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
>   {
>   	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
> -	    attr->map_flags || !attr->btf_vmlinux_value_type_id)
> +	    (attr->map_flags & ~BPF_F_LINK) || !attr->btf_vmlinux_value_type_id)
>   		return -EINVAL;
>   	return 0;
>   }
> @@ -638,6 +647,8 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
>   	set_vm_flush_reset_perms(st_map->image);
>   	bpf_map_init_from_attr(map, attr);
>   
> +	map->map_flags |= attr->map_flags & BPF_F_LINK;

This should have already been done in bpf_map_init_from_attr().

> +
>   	return map;
>   }
>   
> @@ -699,10 +710,25 @@ void bpf_struct_ops_put(const void *kdata)
>   	}
>   }
>   
> +static void bpf_struct_ops_kvalue_put(struct bpf_struct_ops_value *kvalue)
> +{
> +	struct bpf_struct_ops_map *st_map;
> +
> +	if (refcount_dec_and_test(&kvalue->refcnt)) {
> +		st_map = container_of(kvalue, struct bpf_struct_ops_map,
> +				      kvalue);
> +		bpf_map_put(&st_map->map);
> +	}
> +}
> +
>   static void bpf_struct_ops_map_link_release(struct bpf_link *link)
>   {
> +	struct bpf_struct_ops_map *st_map;
> +
>   	if (link->map) {
> -		bpf_map_put(link->map);
> +		st_map = (struct bpf_struct_ops_map *)link->map;
> +		st_map->st_ops->unreg(&st_map->kvalue.data);
> +		bpf_struct_ops_kvalue_put(&st_map->kvalue);
>   		link->map = NULL;

Does it need a lock or something to protect the link_release? or I am missing 
something and lock is not needed?

The kvalue->value state should become UNREG.

After UNREG, can the struct_ops map be used in creating a new link again?

>   	}
>   }
> @@ -735,13 +761,15 @@ static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>   
>   int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
>   {
> +	struct bpf_struct_ops_map *st_map;
>   	struct bpf_link_primer link_primer;
> +	struct bpf_struct_ops_value *kvalue;
>   	struct bpf_map *map;
>   	struct bpf_link *link = NULL;
>   	int err;
>   
>   	map = bpf_map_get(attr->link_create.prog_fd);
> -	if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
> +	if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(map->map_flags & BPF_F_LINK))
>   		return -EINVAL;
>   
>   	link = kzalloc(sizeof(*link), GFP_USER);
> @@ -752,6 +780,29 @@ int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
>   	bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
>   	link->map = map;
>   
> +	if (map->map_flags & BPF_F_LINK) {
> +		st_map = (struct bpf_struct_ops_map *)map;
> +		kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
> +
> +		if (kvalue->state != BPF_STRUCT_OPS_STATE_INUSE ||
> +		    refcount_read(&kvalue->refcnt) != 0) {

The refcount_read(&kvalue->refcnt) is to ensure it is not registered?
It seems the UNREG state is useful here.

> +			err = -EINVAL;
> +			goto err_out;
> +		}
> +
> +		refcount_set(&kvalue->refcnt, 1);

If a struct_ops map is used to create multiple links in parallel, is it safe?

> +
> +		set_memory_rox((long)st_map->image, 1);
> +		err = st_map->st_ops->reg(kvalue->data);

After successful reg, the state can be changed from UNREG to INUSE.

> +		if (err) {
> +			refcount_set(&kvalue->refcnt, 0);
> +
> +			set_memory_nx((long)st_map->image, 1);
> +			set_memory_rw((long)st_map->image, 1);
> +			goto err_out;
> +		}
> +	}

This patch should be combined with patch 1. Otherwise, patch 1 is quite hard to 
understand without link_create_struct_ops_map() doing the actual "attach".

> +
>   	err = bpf_link_prime(link, &link_primer);
>   	if (err)
>   		goto err_out;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 1e6cdd0f355d..48d8b3058aa1 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1267,6 +1267,9 @@ enum {
>   
>   /* Create a map that is suitable to be an inner map with dynamic max entries */
>   	BPF_F_INNER_MAP		= (1U << 12),
> +
> +/* Create a map that will be registered/unregesitered by the backed bpf_link */
> +	BPF_F_LINK		= (1U << 13),
>   };
>   
>   /* Flags for BPF_PROG_QUERY. */


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 5/7] bpf: Update the struct_ops of a bpf_link.
  2023-02-14 22:17 ` [PATCH bpf-next 5/7] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
@ 2023-02-16  1:02   ` Martin KaFai Lau
  2023-02-16 19:17     ` Kui-Feng Lee
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2023-02-16  1:02 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii

On 2/14/23 2:17 PM, Kui-Feng Lee wrote:
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index d16ca06cf09a..d329621fc721 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -752,11 +752,66 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
>   	return 0;
>   }
>   
> +static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map *new_map)
> +{
> +	struct bpf_struct_ops_value *kvalue;
> +	struct bpf_struct_ops_map *st_map, *old_st_map;
> +	struct bpf_map *old_map;
> +	int err;
> +
> +	if (new_map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(new_map->map_flags & BPF_F_LINK))
> +		return -EINVAL;
> +
> +	old_map = link->map;
> +
> +	/* It does nothing if the new map is the same as the old one.
> +	 * A struct_ops that backs a bpf_link can not be updated or
> +	 * its kvalue would be updated and causes inconsistencies.
> +	 */
> +	if (old_map == new_map)
> +		return 0;
> +
> +	/* The new and old struct_ops must be the same type. */
> +	st_map = (struct bpf_struct_ops_map *)new_map;
> +	old_st_map = (struct bpf_struct_ops_map *)old_map;
> +	if (st_map->st_ops != old_st_map->st_ops)
> +		return -EINVAL;
> +
> +	/* Assure the struct_ops is updated (has value) and not
> +	 * backing any other link.
> +	 */
> +	kvalue = &st_map->kvalue;
> +	if (kvalue->state != BPF_STRUCT_OPS_STATE_INUSE ||
> +	    refcount_read(&kvalue->refcnt) != 0)
> +		return -EINVAL;
> +
> +	bpf_map_inc(new_map);
> +	refcount_set(&kvalue->refcnt, 1);
> +
> +	set_memory_rox((long)st_map->image, 1);
> +	err = st_map->st_ops->update(kvalue->data, old_st_map->kvalue.data);
> +	if (err) {
> +		refcount_set(&kvalue->refcnt, 0);
> +
> +		set_memory_nx((long)st_map->image, 1);
> +		set_memory_rw((long)st_map->image, 1);
> +		bpf_map_put(new_map);
> +		return err;
> +	}
> +
> +	link->map = new_map;

Similar here, does this link_update operation needs a lock?

> +
> +	bpf_struct_ops_kvalue_put(&old_st_map->kvalue);
> +
> +	return 0;
> +}
> +
>   static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>   	.release = bpf_struct_ops_map_link_release,
>   	.dealloc = bpf_struct_ops_map_link_dealloc,
>   	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
>   	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
> +	.update_struct_ops = bpf_struct_ops_map_link_update,

This seems a little non-intuitive to add a struct_ops specific thing to the 
generic bpf_link_ops. May be avoid adding ".update_struct_ops" and directly call 
the bpf_struct_ops_map_link_update() from link_update()?


>   };
>   
>   int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 54e172d8f5d1..1341634863b5 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -4650,6 +4650,32 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>   	return ret;
>   }
>   
> +#define BPF_LINK_UPDATE_STRUCT_OPS_LAST_FIELD link_update_struct_ops.new_map_fd

Why it is needed? Does it hit error without it?

> +
> +static int link_update_struct_ops(struct bpf_link *link, union bpf_attr *attr)
> +{
> +	struct bpf_map *new_map;
> +	int ret = 0;
> +
> +	new_map = bpf_map_get(attr->link_update.new_map_fd);
> +	if (IS_ERR(new_map))
> +		return -EINVAL;
> +
> +	if (new_map->map_type != BPF_MAP_TYPE_STRUCT_OPS) {
> +		ret = -EINVAL;
> +		goto out_put_map;
> +	}

How about BPF_F_REPLACE?

> +
> +	if (link->ops->update_struct_ops)
> +		ret = link->ops->update_struct_ops(link, new_map); > +	else
> +		ret = -EINVAL;
> +
> +out_put_map:
> +	bpf_map_put(new_map);
> +	return ret;
> +}
> +
>   #define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
>   
>   static int link_update(union bpf_attr *attr)
> @@ -4670,6 +4696,11 @@ static int link_update(union bpf_attr *attr)
>   	if (IS_ERR(link))
>   		return PTR_ERR(link);
>   
> +	if (link->type == BPF_LINK_TYPE_STRUCT_OPS) {
> +		ret = link_update_struct_ops(link, attr);
> +		goto out_put_link;
> +	}
> +
>   	new_prog = bpf_prog_get(attr->link_update.new_prog_fd);
>   	if (IS_ERR(new_prog)) {
>   		ret = PTR_ERR(new_prog);
> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> index 66ce5fadfe42..558b01d5250f 100644
> --- a/net/ipv4/bpf_tcp_ca.c
> +++ b/net/ipv4/bpf_tcp_ca.c
> @@ -239,8 +239,6 @@ static int bpf_tcp_ca_init_member(const struct btf_type *t,
>   		if (bpf_obj_name_cpy(tcp_ca->name, utcp_ca->name,
>   				     sizeof(tcp_ca->name)) <= 0)
>   			return -EINVAL;
> -		if (tcp_ca_find(utcp_ca->name))
> -			return -EEXIST;

This change is not obvious. Please put some comment in the commit message about 
this change.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links.
  2023-02-16  0:37   ` Martin KaFai Lau
@ 2023-02-16 16:42     ` Kui-Feng Lee
  2023-02-16 22:38       ` Martin KaFai Lau
  0 siblings, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-16 16:42 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii



On 2/15/23 16:37, Martin KaFai Lau wrote:
> On 2/14/23 2:17 PM, Kui-Feng Lee wrote:
>> Registration via bpf_links ensures a uniform behavior, just like other
>> BPF programs.  BPF struct_ops were registered/unregistered when
>> updating/deleting their values.  Only the maps of struct_ops having
>> the BPF_F_LINK flag are allowed to back a bpf_link.
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/uapi/linux/bpf.h       |  3 ++
>>   kernel/bpf/bpf_struct_ops.c    | 59 +++++++++++++++++++++++++++++++---
>>   tools/include/uapi/linux/bpf.h |  3 ++
>>   3 files changed, 61 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 1e6cdd0f355d..48d8b3058aa1 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1267,6 +1267,9 @@ enum {
>>   /* Create a map that is suitable to be an inner map with dynamic max 
>> entries */
>>       BPF_F_INNER_MAP        = (1U << 12),
>> +
>> +/* Create a map that will be registered/unregesitered by the backed 
>> bpf_link */
>> +    BPF_F_LINK        = (1U << 13),
>>   };
>>   /* Flags for BPF_PROG_QUERY. */
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index 621c8e24481a..d16ca06cf09a 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -390,7 +390,7 @@ static int bpf_struct_ops_map_update_elem(struct 
>> bpf_map *map, void *key,
>>       mutex_lock(&st_map->lock);
>> -    if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
>> +    if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT || 
>> refcount_read(&kvalue->refcnt)) {
> 
> Why it needs a new refcount_read(&kvalue->refcnt) check?

It prohibits updating the value once it is registered.
This refcnt is set to 1 when register it.

But, yes, it is confusing since we never reset it back to *_INIT.
The purpose of this refcount_read() will be clear once add *_UNREG, and 
reset it back to *_INIT properly.


> 
>>           err = -EBUSY;
>>           goto unlock;
>>       }
>> @@ -491,6 +491,12 @@ static int bpf_struct_ops_map_update_elem(struct 
>> bpf_map *map, void *key,
>>           *(unsigned long *)(udata + moff) = prog->aux->id;
>>       }
>> +    if (st_map->map.map_flags & BPF_F_LINK) {
>> +        /* Let bpf_link handle registration & unregistration. */
>> +        smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
> 
> INUSE is for registered struct_ops. It needs a new UNREG state to mean 
> initialized but not registered. The kvalue->state is not in uapi but the 
> user space can still introspect it (thanks to BTF), so having a correct 
> semantic state is useful. Try 'bpftool struct_ops dump ...':
> 
>      "bpf_struct_ops_tcp_congestion_ops": {
>          "refcnt": {
>              "refs": {
>                  "counter": 1
>              }
>          },
>          "state": "BPF_STRUCT_OPS_STATE_INUSE",

Ok! That make sense.

> 
>> +        goto unlock;
>> +    }
>> +
>>       refcount_set(&kvalue->refcnt, 1);
>>       bpf_map_inc(map);
>> @@ -522,6 +528,7 @@ static int bpf_struct_ops_map_update_elem(struct 
>> bpf_map *map, void *key,
>>       kfree(tlinks);
>>       mutex_unlock(&st_map->lock);
>>       return err;
>> +
> 
> Unnecessary new line.
> 
>>   }
>>   static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void 
>> *key)
>> @@ -535,6 +542,8 @@ static int bpf_struct_ops_map_delete_elem(struct 
>> bpf_map *map, void *key)
>>                    BPF_STRUCT_OPS_STATE_TOBEFREE);
>>       switch (prev_state) {
>>       case BPF_STRUCT_OPS_STATE_INUSE:
>> +        if (st_map->map.map_flags & BPF_F_LINK)
>> +            return 0;
> 
> This should be a -ENOTSUPP.
Sure!

> 
>>           st_map->st_ops->unreg(&st_map->kvalue.data);
>>           if (refcount_dec_and_test(&st_map->kvalue.refcnt))
>>               bpf_map_put(map);
>> @@ -585,7 +594,7 @@ static void bpf_struct_ops_map_free(struct bpf_map 
>> *map)
>>   static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
>>   {
>>       if (attr->key_size != sizeof(unsigned int) || attr->max_entries 
>> != 1 ||
>> -        attr->map_flags || !attr->btf_vmlinux_value_type_id)
>> +        (attr->map_flags & ~BPF_F_LINK) || 
>> !attr->btf_vmlinux_value_type_id)
>>           return -EINVAL;
>>       return 0;
>>   }
>> @@ -638,6 +647,8 @@ static struct bpf_map 
>> *bpf_struct_ops_map_alloc(union bpf_attr *attr)
>>       set_vm_flush_reset_perms(st_map->image);
>>       bpf_map_init_from_attr(map, attr);
>> +    map->map_flags |= attr->map_flags & BPF_F_LINK;
> 
> This should have already been done in bpf_map_init_from_attr().

bpf_map_init_from_attr() will filter out all flags except BPF_F_RDONLY & 
BPF_F_WRONLY.  But, I can move it to bpf_map_init_from_attr() by not 
filtering out it.

> 
>> +
>>       return map;
>>   }
>> @@ -699,10 +710,25 @@ void bpf_struct_ops_put(const void *kdata)
>>       }
>>   }
>> +static void bpf_struct_ops_kvalue_put(struct bpf_struct_ops_value 
>> *kvalue)
>> +{
>> +    struct bpf_struct_ops_map *st_map;
>> +
>> +    if (refcount_dec_and_test(&kvalue->refcnt)) {
>> +        st_map = container_of(kvalue, struct bpf_struct_ops_map,
>> +                      kvalue);
>> +        bpf_map_put(&st_map->map);
>> +    }
>> +}
>> +
>>   static void bpf_struct_ops_map_link_release(struct bpf_link *link)
>>   {
>> +    struct bpf_struct_ops_map *st_map;
>> +
>>       if (link->map) {
>> -        bpf_map_put(link->map);
>> +        st_map = (struct bpf_struct_ops_map *)link->map;
>> +        st_map->st_ops->unreg(&st_map->kvalue.data);
>> +        bpf_struct_ops_kvalue_put(&st_map->kvalue);
>>           link->map = NULL;
> 
> Does it need a lock or something to protect the link_release? or I am 
> missing something and lock is not needed?

This function will be called by bpf_link_free() following the pointer in 
bpf_link_ops.  And bpf_link_free() is called by bpf_link_put(). The 
refcnt of bpf_link is maintained by bpf_link_put(), and the function 
here indirectly only if the refcnt reachs 0.  If I don't miss anything, 
it should be safe to release a link without a lock.

> 
> The kvalue->value state should become UNREG.
> 
> After UNREG, can the struct_ops map be used in creating a new link again?
> 

It should be.

>>       }
>>   }
>> @@ -735,13 +761,15 @@ static const struct bpf_link_ops 
>> bpf_struct_ops_map_lops = {
>>   int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
>>   {
>> +    struct bpf_struct_ops_map *st_map;
>>       struct bpf_link_primer link_primer;
>> +    struct bpf_struct_ops_value *kvalue;
>>       struct bpf_map *map;
>>       struct bpf_link *link = NULL;
>>       int err;
>>       map = bpf_map_get(attr->link_create.prog_fd);
>> -    if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
>> +    if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(map->map_flags 
>> & BPF_F_LINK))
>>           return -EINVAL;
>>       link = kzalloc(sizeof(*link), GFP_USER);
>> @@ -752,6 +780,29 @@ int link_create_struct_ops_map(union bpf_attr 
>> *attr, bpfptr_t uattr)
>>       bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, 
>> &bpf_struct_ops_map_lops, NULL);
>>       link->map = map;
>> +    if (map->map_flags & BPF_F_LINK) {
>> +        st_map = (struct bpf_struct_ops_map *)map;
>> +        kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
>> +
>> +        if (kvalue->state != BPF_STRUCT_OPS_STATE_INUSE ||
>> +            refcount_read(&kvalue->refcnt) != 0) {
> 
> The refcount_read(&kvalue->refcnt) is to ensure it is not registered?
> It seems the UNREG state is useful here.

Yes!

> 
>> +            err = -EINVAL;
>> +            goto err_out;
>> +        }
>> +
>> +        refcount_set(&kvalue->refcnt, 1);
> 
> If a struct_ops map is used to create multiple links in parallel, is it 
> safe?
> 
>> +
>> +        set_memory_rox((long)st_map->image, 1);
>> +        err = st_map->st_ops->reg(kvalue->data);
> 
> After successful reg, the state can be changed from UNREG to INUSE.
> 
>> +        if (err) {
>> +            refcount_set(&kvalue->refcnt, 0);
>> +
>> +            set_memory_nx((long)st_map->image, 1);
>> +            set_memory_rw((long)st_map->image, 1);
>> +            goto err_out;
>> +        }
>> +    }
> 
> This patch should be combined with patch 1. Otherwise, patch 1 is quite 
> hard to understand without link_create_struct_ops_map() doing the actual 
> "attach".

Ok!

> 
>> +
>>       err = bpf_link_prime(link, &link_primer);
>>       if (err)
>>           goto err_out;
>> diff --git a/tools/include/uapi/linux/bpf.h 
>> b/tools/include/uapi/linux/bpf.h
>> index 1e6cdd0f355d..48d8b3058aa1 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -1267,6 +1267,9 @@ enum {
>>   /* Create a map that is suitable to be an inner map with dynamic max 
>> entries */
>>       BPF_F_INNER_MAP        = (1U << 12),
>> +
>> +/* Create a map that will be registered/unregesitered by the backed 
>> bpf_link */
>> +    BPF_F_LINK        = (1U << 13),
>>   };
>>   /* Flags for BPF_PROG_QUERY. */
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps.
  2023-02-15 22:58   ` Martin KaFai Lau
@ 2023-02-16 17:59     ` Kui-Feng Lee
  0 siblings, 0 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-16 17:59 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii



On 2/15/23 14:58, Martin KaFai Lau wrote:
> On 2/14/23 2:17 PM, Kui-Feng Lee wrote:
>> BPF struct_ops maps are employed directly to register TCP Congestion
>> Control algorithms. Unlike other BPF programs that terminate when
>> their links gone, the struct_ops program reduces its refcount solely
>> upon death of its FD. 
> 
> I think the refcount comment probably not needed for this patch.

Got it!
> 
>> The link of a BPF struct_ops map provides a
>> uniform experience akin to other types of BPF programs.  A TCP
>> Congestion Control algorithm will be unregistered upon destroying the
>> FD in the following patches.
> 
> 
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 17afd2b35ee5..1e6cdd0f355d 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
> 
> The existing BPF_LINK_TYPE_STRUCT_OPS enum is reused. Please explain why 
> it can be reused in the commit message and also add comments around the 
> existing "bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS...)" in 
> bpf_struct_ops_map_update_elem().

Sure!
> 
>> @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
>>       BPF_PERF_EVENT,
>>       BPF_TRACE_KPROBE_MULTI,
>>       BPF_LSM_CGROUP,
>> +    BPF_STRUCT_OPS_MAP,
> 
> nit. Only BPF_STRUCT_OPS. No need for "_MAP".
> 
>>       __MAX_BPF_ATTACH_TYPE
>>   };
>> @@ -6354,6 +6355,9 @@ struct bpf_link_info {
>>           struct {
>>               __u32 ifindex;
>>           } xdp;
>> +        struct {
>> +            __u32 map_id;
>> +        } struct_ops_map;
> 
> nit. Same here, skip the "_map";

Got!

> 
> This looks good instead of union. In case the user space tool directly 
> uses the prog_id without checking the type.
> 
>>       };
>>   } __attribute__((aligned(8)));
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index ece9870cab68..621c8e24481a 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -698,3 +698,69 @@ void bpf_struct_ops_put(const void *kdata)
>>           call_rcu(&st_map->rcu, bpf_struct_ops_put_rcu);
>>       }
>>   }
>> +
>> +static void bpf_struct_ops_map_link_release(struct bpf_link *link)
>> +{
>> +    if (link->map) {
>> +        bpf_map_put(link->map);
>> +        link->map = NULL;
>> +    }
>> +}
>> +
>> +static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>> +{
>> +    kfree(link);
>> +}
>> +
>> +static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link 
>> *link,
>> +                        struct seq_file *seq)
>> +{
>> +    seq_printf(seq, "map_id:\t%d\n",
>> +          link->map->id);
>> +}
>> +
>> +static int bpf_struct_ops_map_link_fill_link_info(const struct 
>> bpf_link *link,
>> +                           struct bpf_link_info *info)
>> +{
>> +    info->struct_ops_map.map_id = link->map->id;
>> +    return 0;
>> +}
>> +
>> +static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>> +    .release = bpf_struct_ops_map_link_release,
>> +    .dealloc = bpf_struct_ops_map_link_dealloc,
>> +    .show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
>> +    .fill_link_info = bpf_struct_ops_map_link_fill_link_info,
>> +};
> 
> Can .detach be supported also?

Sure!

> 
>> +
>> +int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
> 
> Does it need uattr?
> 
> nit. Rename to bpf_struct_ops_link_attach(), like how other link type's 
> "attach" functions are named. or may be even just bpf_struct_ops_attach().

Got it!

> 
>> +{
>> +    struct bpf_link_primer link_primer;
>> +    struct bpf_map *map;
>> +    struct bpf_link *link = NULL;
>> +    int err;
>> +
>> +    map = bpf_map_get(attr->link_create.prog_fd);
> 
> This one looks weird. passing prog_fd to bpf_map_get(). I think in this 
> case it makes sense to union map_fd with prog_fd in attr->link_create ?
> 
>> +    if (map->map_type != BPF_MAP_TYPE_STRUCT_OPS)
> 
> map is leaked.

Yes, I will fix it.
> 
>> +        return -EINVAL;
>> +
>> +    link = kzalloc(sizeof(*link), GFP_USER);
>> +    if (!link) {
>> +        err = -ENOMEM;
>> +        goto err_out;
>> +    }
>> +    bpf_link_init(link, BPF_LINK_TYPE_STRUCT_OPS, 
>> &bpf_struct_ops_map_lops, NULL);
>> +    link->map = map;
>> +
>> +    err = bpf_link_prime(link, &link_primer);
>> +    if (err)
>> +        goto err_out;
>> +
>> +    return bpf_link_settle(&link_primer);
>> +
>> +err_out:
>> +    bpf_map_put(map);
>> +    kfree(link);
>> +    return err;
>> +}
>> +
> 
> [ ... ]
> 
>> +extern int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t 
>> uattr);
> 
> Move it to bpf.h.
> 
>> +
>>   #define BPF_LINK_CREATE_LAST_FIELD link_create.kprobe_multi.cookies
>>   static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>>   {
>> @@ -4541,6 +4549,9 @@ static int link_create(union bpf_attr *attr, 
>> bpfptr_t uattr)
>>       if (CHECK_ATTR(BPF_LINK_CREATE))
>>           return -EINVAL;
>> +    if (attr->link_create.attach_type == BPF_STRUCT_OPS_MAP)
>> +        return link_create_struct_ops_map(attr, uattr);
>> +
>>       prog = bpf_prog_get(attr->link_create.prog_fd);
>>       if (IS_ERR(prog))
>>           return PTR_ERR(prog);
>> diff --git a/tools/include/uapi/linux/bpf.h 
>> b/tools/include/uapi/linux/bpf.h
>> index 17afd2b35ee5..1e6cdd0f355d 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -1033,6 +1033,7 @@ enum bpf_attach_type {
>>       BPF_PERF_EVENT,
>>       BPF_TRACE_KPROBE_MULTI,
>>       BPF_LSM_CGROUP,
>> +    BPF_STRUCT_OPS_MAP,
>>       __MAX_BPF_ATTACH_TYPE
>>   };
>> @@ -6354,6 +6355,9 @@ struct bpf_link_info {
>>           struct {
>>               __u32 ifindex;
>>           } xdp;
>> +        struct {
>> +            __u32 map_id;
>> +        } struct_ops_map;
>>       };
>>   } __attribute__((aligned(8)));
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index 9aff98f42a3d..e44d49f17c86 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
> 
> Not necessary in this set and could be a follow up. Have you thought 
> about how to generate a skel including the struct_ops link?

The user must now call bpf_map__set_map_flags() between XXXX__open() and 
XXX__load(). To simplify the process, skel should invoke 
bpf_map__set_map_flags() in the function of XXX__open_and _load(). 
Therefore, a method to indicate which struct_ops need a link is 
necessary. For instance,

SEC(".struct_ops")
struct tcp_congestion_ops xxx_map = {
  ...
  .flags = BPF_F_LINK,
  ...
};

We probably can do it without any change to the code generator.

> 
>> @@ -731,6 +731,8 @@ int bpf_link_create(int prog_fd, int target_fd,
>>           if (!OPTS_ZEROED(opts, tracing))
>>               return libbpf_err(-EINVAL);
>>           break;
>> +    case BPF_STRUCT_OPS_MAP:
>> +        break;
>>       default:
>>           if (!OPTS_ZEROED(opts, flags))
>>               return libbpf_err(-EINVAL);
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 35a698eb825d..75ed95b7e455 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -115,6 +115,7 @@ static const char * const attach_type_name[] = {
>>       [BPF_SK_REUSEPORT_SELECT_OR_MIGRATE]    = 
>> "sk_reuseport_select_or_migrate",
>>       [BPF_PERF_EVENT]        = "perf_event",
>>       [BPF_TRACE_KPROBE_MULTI]    = "trace_kprobe_multi",
>> +    [BPF_STRUCT_OPS_MAP]        = "struct_ops_map",
>>   };
>>   static const char * const link_type_name[] = {
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 5/7] bpf: Update the struct_ops of a bpf_link.
  2023-02-16  1:02   ` Martin KaFai Lau
@ 2023-02-16 19:17     ` Kui-Feng Lee
  2023-02-16 19:40       ` Martin KaFai Lau
  0 siblings, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-16 19:17 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii



On 2/15/23 17:02, Martin KaFai Lau wrote:
> On 2/14/23 2:17 PM, Kui-Feng Lee wrote:
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index d16ca06cf09a..d329621fc721 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -752,11 +752,66 @@ static int 
>> bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
>>       return 0;
>>   }
>> +static int bpf_struct_ops_map_link_update(struct bpf_link *link, 
>> struct bpf_map *new_map)
>> +{
>> +    struct bpf_struct_ops_value *kvalue;
>> +    struct bpf_struct_ops_map *st_map, *old_st_map;
>> +    struct bpf_map *old_map;
>> +    int err;
>> +
>> +    if (new_map->map_type != BPF_MAP_TYPE_STRUCT_OPS || 
>> !(new_map->map_flags & BPF_F_LINK))
>> +        return -EINVAL;
>> +
>> +    old_map = link->map;
>> +
>> +    /* It does nothing if the new map is the same as the old one.
>> +     * A struct_ops that backs a bpf_link can not be updated or
>> +     * its kvalue would be updated and causes inconsistencies.
>> +     */
>> +    if (old_map == new_map)
>> +        return 0;
>> +
>> +    /* The new and old struct_ops must be the same type. */
>> +    st_map = (struct bpf_struct_ops_map *)new_map;
>> +    old_st_map = (struct bpf_struct_ops_map *)old_map;
>> +    if (st_map->st_ops != old_st_map->st_ops)
>> +        return -EINVAL;
>> +
>> +    /* Assure the struct_ops is updated (has value) and not
>> +     * backing any other link.
>> +     */
>> +    kvalue = &st_map->kvalue;
>> +    if (kvalue->state != BPF_STRUCT_OPS_STATE_INUSE ||
>> +        refcount_read(&kvalue->refcnt) != 0)
>> +        return -EINVAL;
>> +
>> +    bpf_map_inc(new_map);
>> +    refcount_set(&kvalue->refcnt, 1);
>> +
>> +    set_memory_rox((long)st_map->image, 1);
>> +    err = st_map->st_ops->update(kvalue->data, old_st_map->kvalue.data);
>> +    if (err) {
>> +        refcount_set(&kvalue->refcnt, 0);
>> +
>> +        set_memory_nx((long)st_map->image, 1);
>> +        set_memory_rw((long)st_map->image, 1);
>> +        bpf_map_put(new_map);
>> +        return err;
>> +    }
>> +
>> +    link->map = new_map;
> 
> Similar here, does this link_update operation needs a lock?

The update function of tcp_ca checks if the name is unique with the 
protection of a lock.  bpf_struct_ops_map_update_elem() also check and 
update state of the kvalue to prevent changing kvalue.  Only one of 
thread will success to register or update at any moment.

> 
>> +
>> +    bpf_struct_ops_kvalue_put(&old_st_map->kvalue);
>> +
>> +    return 0;
>> +}
>> +
>>   static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>>       .release = bpf_struct_ops_map_link_release,
>>       .dealloc = bpf_struct_ops_map_link_dealloc,
>>       .show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
>>       .fill_link_info = bpf_struct_ops_map_link_fill_link_info,
>> +    .update_struct_ops = bpf_struct_ops_map_link_update,
> 
> This seems a little non-intuitive to add a struct_ops specific thing to 
> the generic bpf_link_ops. May be avoid adding ".update_struct_ops" and 
> directly call the bpf_struct_ops_map_link_update() from link_update()?

It has `.update_prog` for BPF programs so `.update_struct_ops` or 
`.update_map` is not that weird for me.  It would be better to have a 
`.update_link` to receive either a bpf_prog or bpf_map, and remove 
`.update_prog`.

> 
> 
>>   };
>>   int link_create_struct_ops_map(union bpf_attr *attr, bpfptr_t uattr)
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 54e172d8f5d1..1341634863b5 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -4650,6 +4650,32 @@ static int link_create(union bpf_attr *attr, 
>> bpfptr_t uattr)
>>       return ret;
>>   }
>> +#define BPF_LINK_UPDATE_STRUCT_OPS_LAST_FIELD 
>> link_update_struct_ops.new_map_fd
> 
> Why it is needed? Does it hit error without it?

It can be removed now.

> 
>> +
>> +static int link_update_struct_ops(struct bpf_link *link, union 
>> bpf_attr *attr)
>> +{
>> +    struct bpf_map *new_map;
>> +    int ret = 0;
>> +
>> +    new_map = bpf_map_get(attr->link_update.new_map_fd);
>> +    if (IS_ERR(new_map))
>> +        return -EINVAL;
>> +
>> +    if (new_map->map_type != BPF_MAP_TYPE_STRUCT_OPS) {
>> +        ret = -EINVAL;
>> +        goto out_put_map;
>> +    }
> 
> How about BPF_F_REPLACE?

Do you mean the new_map should be labeled with BPF_F_REPLACE to replace 
the old one?


> 
>> +
>> +    if (link->ops->update_struct_ops)
>> +        ret = link->ops->update_struct_ops(link, new_map); > +    else
>> +        ret = -EINVAL;
>> +
>> +out_put_map:
>> +    bpf_map_put(new_map);
>> +    return ret;
>> +}
>> +
>>   #define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
>>   static int link_update(union bpf_attr *attr)
>> @@ -4670,6 +4696,11 @@ static int link_update(union bpf_attr *attr)
>>       if (IS_ERR(link))
>>           return PTR_ERR(link);
>> +    if (link->type == BPF_LINK_TYPE_STRUCT_OPS) {
>> +        ret = link_update_struct_ops(link, attr);
>> +        goto out_put_link;
>> +    }
>> +
>>       new_prog = bpf_prog_get(attr->link_update.new_prog_fd);
>>       if (IS_ERR(new_prog)) {
>>           ret = PTR_ERR(new_prog);
>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
>> index 66ce5fadfe42..558b01d5250f 100644
>> --- a/net/ipv4/bpf_tcp_ca.c
>> +++ b/net/ipv4/bpf_tcp_ca.c
>> @@ -239,8 +239,6 @@ static int bpf_tcp_ca_init_member(const struct 
>> btf_type *t,
>>           if (bpf_obj_name_cpy(tcp_ca->name, utcp_ca->name,
>>                        sizeof(tcp_ca->name)) <= 0)
>>               return -EINVAL;
>> -        if (tcp_ca_find(utcp_ca->name))
>> -            return -EEXIST;
> 
> This change is not obvious. Please put some comment in the commit 
> message about this change.
> 
sure!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 5/7] bpf: Update the struct_ops of a bpf_link.
  2023-02-16 19:17     ` Kui-Feng Lee
@ 2023-02-16 19:40       ` Martin KaFai Lau
  0 siblings, 0 replies; 43+ messages in thread
From: Martin KaFai Lau @ 2023-02-16 19:40 UTC (permalink / raw)
  To: Kui-Feng Lee, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii

On 2/16/23 11:17 AM, Kui-Feng Lee wrote:
>>> +static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct 
>>> bpf_map *new_map)
>>> +{
>>> +    struct bpf_struct_ops_value *kvalue;
>>> +    struct bpf_struct_ops_map *st_map, *old_st_map;
>>> +    struct bpf_map *old_map;
>>> +    int err;
>>> +
>>> +    if (new_map->map_type != BPF_MAP_TYPE_STRUCT_OPS || !(new_map->map_flags 
>>> & BPF_F_LINK))
>>> +        return -EINVAL;
>>> +
>>> +    old_map = link->map;
>>> +
>>> +    /* It does nothing if the new map is the same as the old one.
>>> +     * A struct_ops that backs a bpf_link can not be updated or
>>> +     * its kvalue would be updated and causes inconsistencies.
>>> +     */
>>> +    if (old_map == new_map)
>>> +        return 0;
>>> +
>>> +    /* The new and old struct_ops must be the same type. */
>>> +    st_map = (struct bpf_struct_ops_map *)new_map;
>>> +    old_st_map = (struct bpf_struct_ops_map *)old_map;
>>> +    if (st_map->st_ops != old_st_map->st_ops)
>>> +        return -EINVAL;
>>> +
>>> +    /* Assure the struct_ops is updated (has value) and not
>>> +     * backing any other link.
>>> +     */
>>> +    kvalue = &st_map->kvalue;
>>> +    if (kvalue->state != BPF_STRUCT_OPS_STATE_INUSE ||
>>> +        refcount_read(&kvalue->refcnt) != 0)
>>> +        return -EINVAL;
>>> +
>>> +    bpf_map_inc(new_map);
>>> +    refcount_set(&kvalue->refcnt, 1);
>>> +
>>> +    set_memory_rox((long)st_map->image, 1);
>>> +    err = st_map->st_ops->update(kvalue->data, old_st_map->kvalue.data);
>>> +    if (err) {
>>> +        refcount_set(&kvalue->refcnt, 0);
>>> +
>>> +        set_memory_nx((long)st_map->image, 1);
>>> +        set_memory_rw((long)st_map->image, 1);
>>> +        bpf_map_put(new_map);
>>> +        return err;
>>> +    }
>>> +
>>> +    link->map = new_map;
>>
>> Similar here, does this link_update operation needs a lock?
> 
> The update function of tcp_ca checks if the name is unique with the protection 
> of a lock.  bpf_struct_ops_map_update_elem() also check and update state of the 
> kvalue to prevent changing kvalue.  Only one of thread will success to register 
> or update at any moment.

hmm... meaning the lock inside the "->update()" function?  There are many 
variables outside of update() that this function 
(bpf_struct_ops_map_link_update) is setting and testing without a lock. eg. the 
succeeded thread will set refcnt to 1 while the failed thread will set it back 
to 0...

>>> +
>>> +static int link_update_struct_ops(struct bpf_link *link, union bpf_attr *attr)
>>> +{
>>> +    struct bpf_map *new_map;
>>> +    int ret = 0;
>>> +
>>> +    new_map = bpf_map_get(attr->link_update.new_map_fd);
>>> +    if (IS_ERR(new_map))
>>> +        return -EINVAL;
>>> +
>>> +    if (new_map->map_type != BPF_MAP_TYPE_STRUCT_OPS) {
>>> +        ret = -EINVAL;
>>> +        goto out_put_map;
>>> +    }
>>
>> How about BPF_F_REPLACE?
> 
> Do you mean the new_map should be labeled with BPF_F_REPLACE to replace the old 
> one?

was asking if BPF_F_REPLACE is supported.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links.
  2023-02-16 16:42     ` Kui-Feng Lee
@ 2023-02-16 22:38       ` Martin KaFai Lau
  2023-02-17 22:17         ` Kui-Feng Lee
  0 siblings, 1 reply; 43+ messages in thread
From: Martin KaFai Lau @ 2023-02-16 22:38 UTC (permalink / raw)
  To: Kui-Feng Lee, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii

On 2/16/23 8:42 AM, Kui-Feng Lee wrote:
>>> @@ -638,6 +647,8 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union 
>>> bpf_attr *attr)
>>>       set_vm_flush_reset_perms(st_map->image);
>>>       bpf_map_init_from_attr(map, attr);
>>> +    map->map_flags |= attr->map_flags & BPF_F_LINK;
>>
>> This should have already been done in bpf_map_init_from_attr().
> 
> bpf_map_init_from_attr() will filter out all flags except BPF_F_RDONLY & 
> BPF_F_WRONLY.

should be the opposite:

static u32 bpf_map_flags_retain_permanent(u32 flags)
{
	/* Some map creation flags are not tied to the map object but
          * rather to the map fd instead, so they have no meaning upon
          * map object inspection since multiple file descriptors with
          * different (access) properties can exist here. Thus, given
          * this has zero meaning for the map itself, lets clear these
          * from here.
          */
	return flags & ~(BPF_F_RDONLY | BPF_F_WRONLY);
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-02-14 22:17 ` [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
  2023-02-15  2:58   ` Stanislav Fomichev
@ 2023-02-16 22:40   ` Andrii Nakryiko
  2023-02-16 22:59     ` Martin KaFai Lau
  2023-02-18  0:05     ` Kui-Feng Lee
  1 sibling, 2 replies; 43+ messages in thread
From: Andrii Nakryiko @ 2023-02-16 22:40 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii

On Tue, Feb 14, 2023 at 2:17 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>
> bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
> placeholder, but now it is constructing an authentic one by calling
> bpf_link_create() if the map has the BPF_F_LINK flag.
>
> You can flag a struct_ops map with BPF_F_LINK by calling
> bpf_map__set_map_flags().
>
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>  tools/lib/bpf/libbpf.c | 73 +++++++++++++++++++++++++++++++++---------
>  1 file changed, 58 insertions(+), 15 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 75ed95b7e455..1eff6a03ddd9 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -11430,29 +11430,41 @@ struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
>         return link;
>  }
>
> +struct bpf_link_struct_ops_map {

let's drop the "_map" suffix? struct_ops is always a map, so no need
to point this out

> +       struct bpf_link link;
> +       int map_fd;
> +};
> +
>  static int bpf_link__detach_struct_ops(struct bpf_link *link)
>  {
> +       struct bpf_link_struct_ops_map *st_link;
>         __u32 zero = 0;
>
> -       if (bpf_map_delete_elem(link->fd, &zero))
> +       st_link = container_of(link, struct bpf_link_struct_ops_map, link);
> +
> +       if (st_link->map_fd < 0) {
> +               /* Fake bpf_link */
> +               if (bpf_map_delete_elem(link->fd, &zero))
> +                       return -errno;
> +               return 0;
> +       }
> +
> +       if (bpf_map_delete_elem(st_link->map_fd, &zero))
> +               return -errno;
> +
> +       if (close(link->fd))
>                 return -errno;
>
>         return 0;
>  }
>
> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> +/*
> + * Update the map with the prepared vdata.
> + */
> +static int bpf_map__update_vdata(const struct bpf_map *map)

this is internal helper, so let's not use double underscores, just
bpf_map_update_vdata()

>  {
>         struct bpf_struct_ops *st_ops;
> -       struct bpf_link *link;
>         __u32 i, zero = 0;
> -       int err;
> -
> -       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> -               return libbpf_err_ptr(-EINVAL);
> -
> -       link = calloc(1, sizeof(*link));
> -       if (!link)
> -               return libbpf_err_ptr(-EINVAL);
>
>         st_ops = map->st_ops;
>         for (i = 0; i < btf_vlen(st_ops->type); i++) {
> @@ -11468,17 +11480,48 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>                 *(unsigned long *)kern_data = prog_fd;
>         }
>
> -       err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
> +       return bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
> +}
> +
> +struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> +{
> +       struct bpf_link_struct_ops_map *link;
> +       int err, fd;
> +
> +       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> +               return libbpf_err_ptr(-EINVAL);
> +
> +       link = calloc(1, sizeof(*link));
> +       if (!link)
> +               return libbpf_err_ptr(-EINVAL);
> +
> +       err = bpf_map__update_vdata(map);
>         if (err) {
>                 err = -errno;
>                 free(link);
>                 return libbpf_err_ptr(err);
>         }
>
> -       link->detach = bpf_link__detach_struct_ops;
> -       link->fd = map->fd;
> +       link->link.detach = bpf_link__detach_struct_ops;
>
> -       return link;
> +       if (!(map->def.map_flags & BPF_F_LINK)) {

So this will always require a programmatic bpf_map__set_map_flags()
call, there is currently no declarative way to do this, right?

Is there any way to avoid this BPF_F_LINK flag approach? How bad would
it be if kernel just always created bpf_link-backed struct_ops?

Alternatively, should we think about SEC(".struct_ops.link") or
something like that to instruct libbpf to add this BPF_F_LINK flag
automatically?

> +               /* Fake bpf_link */
> +               link->link.fd = map->fd;
> +               link->map_fd = -1;
> +               return &link->link;
> +       }
> +

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 6/7] libbpf: Update a bpf_link with another struct_ops.
  2023-02-14 22:17 ` [PATCH bpf-next 6/7] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
@ 2023-02-16 22:48   ` Andrii Nakryiko
  2023-02-18  0:22     ` Kui-Feng Lee
  0 siblings, 1 reply; 43+ messages in thread
From: Andrii Nakryiko @ 2023-02-16 22:48 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii

On Tue, Feb 14, 2023 at 2:17 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>
> Introduce bpf_link__update_struct_ops(), which will allow you to
> effortlessly transition the struct_ops map of any given bpf_link into
> an alternative.
>
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>  tools/lib/bpf/libbpf.c   | 35 +++++++++++++++++++++++++++++++++++
>  tools/lib/bpf/libbpf.h   |  1 +
>  tools/lib/bpf/libbpf.map |  1 +
>  3 files changed, 37 insertions(+)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 1eff6a03ddd9..6f7c72e312d4 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -11524,6 +11524,41 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>         return &link->link;
>  }
>
> +/*
> + * Swap the back struct_ops of a link with a new struct_ops map.
> + */
> +int bpf_link__update_struct_ops(struct bpf_link *link, const struct bpf_map *map)

we have bpf_link__update_program(), and so the generic counterpart for
map-based links would be bpf_link__update_map(). Let's call it that.
And it shouldn't probably assume so much struct_ops specific things.

> +{
> +       struct bpf_link_struct_ops_map *st_ops_link;
> +       int err, fd;
> +
> +       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> +               return -EINVAL;
> +
> +       /* Ensure the type of a link is correct */
> +       if (link->detach != bpf_link__detach_struct_ops)
> +               return -EINVAL;
> +
> +       err = bpf_map__update_vdata(map);

it's a bit weird we do this at attach time, not when bpf_map is
actually instantiated. Should we move this map contents initialization
to bpf_object__load() phase? Same for bpf_map__attach_struct_ops().
What do we lose by doing it after all the BPF programs are loaded in
load phase?

> +       if (err) {
> +               err = -errno;
> +               free(link);
> +               return err;
> +       }
> +
> +       fd = bpf_link_update(link->fd, map->fd, NULL);
> +       if (fd < 0) {
> +               err = -errno;
> +               free(link);
> +               return err;
> +       }
> +
> +       st_ops_link = container_of(link, struct bpf_link_struct_ops_map, link);
> +       st_ops_link->map_fd = map->fd;
> +
> +       return 0;
> +}
> +
>  typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
>                                                           void *private_data);
>
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 2efd80f6f7b9..dd25cd6759d4 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -695,6 +695,7 @@ bpf_program__attach_freplace(const struct bpf_program *prog,
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +LIBBPF_API int bpf_link__update_struct_ops(struct bpf_link *link, const struct bpf_map *map);

let's rename to bpf_link__update_map() and put it next to
bpf_link__update_program() in libbpf.h

>
>  struct bpf_iter_attach_opts {
>         size_t sz; /* size of this struct for forward/backward compatibility */
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 11c36a3c1a9f..ca6993c744b6 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -373,6 +373,7 @@ LIBBPF_1.1.0 {
>         global:
>                 bpf_btf_get_fd_by_id_opts;
>                 bpf_link_get_fd_by_id_opts;
> +               bpf_link__update_struct_ops;
>                 bpf_map_get_fd_by_id_opts;
>                 bpf_prog_get_fd_by_id_opts;
>                 user_ring_buffer__discard;

we are in LIBBPF_1.2.0 already, please move


> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 7/7] selftests/bpf: Test switching TCP Congestion Control algorithms.
  2023-02-14 22:17 ` [PATCH bpf-next 7/7] selftests/bpf: Test switching TCP Congestion Control algorithms Kui-Feng Lee
@ 2023-02-16 22:50   ` Andrii Nakryiko
  2023-02-18  0:23     ` Kui-Feng Lee
  0 siblings, 1 reply; 43+ messages in thread
From: Andrii Nakryiko @ 2023-02-16 22:50 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii

On Tue, Feb 14, 2023 at 2:17 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>
> Create a pair of sockets that utilize the congestion control algorithm
> under a particular name. Then switch up this congestion control
> algorithm to another implementation and check whether newly created
> connections using the same cc name now run the new implementation.
>
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 48 ++++++++++++
>  .../selftests/bpf/progs/tcp_ca_update.c       | 75 +++++++++++++++++++
>  2 files changed, 123 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/tcp_ca_update.c
>

[...]

> diff --git a/tools/testing/selftests/bpf/progs/tcp_ca_update.c b/tools/testing/selftests/bpf/progs/tcp_ca_update.c
> new file mode 100644
> index 000000000000..cf51fe54ac01
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/tcp_ca_update.c
> @@ -0,0 +1,75 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +int ca1_cnt = 0;
> +int ca2_cnt = 0;
> +
> +#define USEC_PER_SEC 1000000UL
> +
> +#define min(a, b) ((a) < (b) ? (a) : (b))
> +
> +static inline struct tcp_sock *tcp_sk(const struct sock *sk)
> +{
> +       return (struct tcp_sock *)sk;
> +}
> +
> +SEC("struct_ops/ca_update_init")
> +void BPF_PROG(ca_update_init, struct sock *sk)
> +{
> +#ifdef ENABLE_ATOMICS_TESTS

it's been 2 years since atomics were added to Clang, I think it's fine
to just assume atomic operations are supported and not do the
ENABLE_ATOMICS_TEST (and I'd clean up ENABLE_ATOMICS_TESTS now as
well)

> +       __sync_bool_compare_and_swap(&sk->sk_pacing_status, SK_PACING_NONE,
> +                                    SK_PACING_NEEDED);
> +#else
> +       sk->sk_pacing_status = SK_PACING_NEEDED;
> +#endif
> +}
> +
> +SEC("struct_ops/ca_update_1_cong_control")
> +void BPF_PROG(ca_update_1_cong_control, struct sock *sk,
> +             const struct rate_sample *rs)
> +{
> +       ca1_cnt++;
> +}
> +
> +SEC("struct_ops/ca_update_2_cong_control")
> +void BPF_PROG(ca_update_2_cong_control, struct sock *sk,
> +             const struct rate_sample *rs)
> +{
> +       ca2_cnt++;
> +}
> +
> +SEC("struct_ops/ca_update_ssthresh")
> +__u32 BPF_PROG(ca_update_ssthresh, struct sock *sk)
> +{
> +       return tcp_sk(sk)->snd_ssthresh;
> +}
> +
> +SEC("struct_ops/ca_update_undo_cwnd")
> +__u32 BPF_PROG(ca_update_undo_cwnd, struct sock *sk)
> +{
> +       return tcp_sk(sk)->snd_cwnd;
> +}
> +
> +SEC(".struct_ops")
> +struct tcp_congestion_ops ca_update_1 = {
> +       .init = (void *)ca_update_init,
> +       .cong_control = (void *)ca_update_1_cong_control,
> +       .ssthresh = (void *)ca_update_ssthresh,
> +       .undo_cwnd = (void *)ca_update_undo_cwnd,
> +       .name = "tcp_ca_update",
> +};
> +
> +SEC(".struct_ops")
> +struct tcp_congestion_ops ca_update_2 = {
> +       .init = (void *)ca_update_init,
> +       .cong_control = (void *)ca_update_2_cong_control,
> +       .ssthresh = (void *)ca_update_ssthresh,
> +       .undo_cwnd = (void *)ca_update_undo_cwnd,
> +       .name = "tcp_ca_update",
> +};
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-02-16 22:40   ` Andrii Nakryiko
@ 2023-02-16 22:59     ` Martin KaFai Lau
  2023-02-18  0:05     ` Kui-Feng Lee
  1 sibling, 0 replies; 43+ messages in thread
From: Martin KaFai Lau @ 2023-02-16 22:59 UTC (permalink / raw)
  To: Andrii Nakryiko, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii

On 2/16/23 2:40 PM, Andrii Nakryiko wrote:
> So this will always require a programmatic bpf_map__set_map_flags()
> call, there is currently no declarative way to do this, right?
> 
> Is there any way to avoid this BPF_F_LINK flag approach? How bad would
> it be if kernel just always created bpf_link-backed struct_ops?

It still needs to support the per-link behavior.

> Alternatively, should we think about SEC(".struct_ops.link") or
> something like that to instruct libbpf to add this BPF_F_LINK flag
> automatically?

I like this idea. Easier for users that is always link only. The users can also 
stay with SEC(".struct_ops") and decide later if BPF_F_LINK should be set 
depending on the runtime config and environment like kernel version...etc.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links.
  2023-02-16 22:38       ` Martin KaFai Lau
@ 2023-02-17 22:17         ` Kui-Feng Lee
  0 siblings, 0 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-17 22:17 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii



On 2/16/23 14:38, Martin KaFai Lau wrote:
> On 2/16/23 8:42 AM, Kui-Feng Lee wrote:
>>>> @@ -638,6 +647,8 @@ static struct bpf_map 
>>>> *bpf_struct_ops_map_alloc(union bpf_attr *attr)
>>>>       set_vm_flush_reset_perms(st_map->image);
>>>>       bpf_map_init_from_attr(map, attr);
>>>> +    map->map_flags |= attr->map_flags & BPF_F_LINK;
>>>
>>> This should have already been done in bpf_map_init_from_attr().
>>
>> bpf_map_init_from_attr() will filter out all flags except BPF_F_RDONLY 
>> & BPF_F_WRONLY.
> 
> should be the opposite:
> 
> static u32 bpf_map_flags_retain_permanent(u32 flags)
> {
>      /* Some map creation flags are not tied to the map object but
>           * rather to the map fd instead, so they have no meaning upon
>           * map object inspection since multiple file descriptors with
>           * different (access) properties can exist here. Thus, given
>           * this has zero meaning for the map itself, lets clear these
>           * from here.
>           */
>      return flags & ~(BPF_F_RDONLY | BPF_F_WRONLY);
> }

Got it! Thank you!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-02-16 22:40   ` Andrii Nakryiko
  2023-02-16 22:59     ` Martin KaFai Lau
@ 2023-02-18  0:05     ` Kui-Feng Lee
  2023-02-18  1:08       ` Andrii Nakryiko
  1 sibling, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-18  0:05 UTC (permalink / raw)
  To: Andrii Nakryiko, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii



On 2/16/23 14:40, Andrii Nakryiko wrote:
> On Tue, Feb 14, 2023 at 2:17 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>>
>> bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
>> placeholder, but now it is constructing an authentic one by calling
>> bpf_link_create() if the map has the BPF_F_LINK flag.
>>
>> You can flag a struct_ops map with BPF_F_LINK by calling
>> bpf_map__set_map_flags().
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   tools/lib/bpf/libbpf.c | 73 +++++++++++++++++++++++++++++++++---------
>>   1 file changed, 58 insertions(+), 15 deletions(-)
>>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 75ed95b7e455..1eff6a03ddd9 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -11430,29 +11430,41 @@ struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
>>          return link;
>>   }
>>
>> +struct bpf_link_struct_ops_map {
> 
> let's drop the "_map" suffix? struct_ops is always a map, so no need
> to point this out

Sure!

> 
>> +       struct bpf_link link;
>> +       int map_fd;
>> +};
>> +
>>   static int bpf_link__detach_struct_ops(struct bpf_link *link)
>>   {
>> +       struct bpf_link_struct_ops_map *st_link;
>>          __u32 zero = 0;
>>
>> -       if (bpf_map_delete_elem(link->fd, &zero))
>> +       st_link = container_of(link, struct bpf_link_struct_ops_map, link);
>> +
>> +       if (st_link->map_fd < 0) {
>> +               /* Fake bpf_link */
>> +               if (bpf_map_delete_elem(link->fd, &zero))
>> +                       return -errno;
>> +               return 0;
>> +       }
>> +
>> +       if (bpf_map_delete_elem(st_link->map_fd, &zero))
>> +               return -errno;
>> +
>> +       if (close(link->fd))
>>                  return -errno;
>>
>>          return 0;
>>   }
>>
>> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>> +/*
>> + * Update the map with the prepared vdata.
>> + */
>> +static int bpf_map__update_vdata(const struct bpf_map *map)
> 
> this is internal helper, so let's not use double underscores, just
> bpf_map_update_vdata()

Ok!

> 
>>   {
>>          struct bpf_struct_ops *st_ops;
>> -       struct bpf_link *link;
>>          __u32 i, zero = 0;
>> -       int err;
>> -
>> -       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
>> -               return libbpf_err_ptr(-EINVAL);
>> -
>> -       link = calloc(1, sizeof(*link));
>> -       if (!link)
>> -               return libbpf_err_ptr(-EINVAL);
>>
>>          st_ops = map->st_ops;
>>          for (i = 0; i < btf_vlen(st_ops->type); i++) {
>> @@ -11468,17 +11480,48 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>                  *(unsigned long *)kern_data = prog_fd;
>>          }
>>
>> -       err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
>> +       return bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
>> +}
>> +
>> +struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>> +{
>> +       struct bpf_link_struct_ops_map *link;
>> +       int err, fd;
>> +
>> +       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
>> +               return libbpf_err_ptr(-EINVAL);
>> +
>> +       link = calloc(1, sizeof(*link));
>> +       if (!link)
>> +               return libbpf_err_ptr(-EINVAL);
>> +
>> +       err = bpf_map__update_vdata(map);
>>          if (err) {
>>                  err = -errno;
>>                  free(link);
>>                  return libbpf_err_ptr(err);
>>          }
>>
>> -       link->detach = bpf_link__detach_struct_ops;
>> -       link->fd = map->fd;
>> +       link->link.detach = bpf_link__detach_struct_ops;
>>
>> -       return link;
>> +       if (!(map->def.map_flags & BPF_F_LINK)) {
> 
> So this will always require a programmatic bpf_map__set_map_flags()
> call, there is currently no declarative way to do this, right?
> 
> Is there any way to avoid this BPF_F_LINK flag approach? How bad would
> it be if kernel just always created bpf_link-backed struct_ops?
> 
> Alternatively, should we think about SEC(".struct_ops.link") or
> something like that to instruct libbpf to add this BPF_F_LINK flag
> automatically?

Agree!

The other solution is to add a flag when declare a struct_ops.

  SEC(".struct_ops")
  tcp_congestion_ops ops = {
      ...
      .flags = WITH_LINK,
  }


> 
>> +               /* Fake bpf_link */
>> +               link->link.fd = map->fd;
>> +               link->map_fd = -1;
>> +               return &link->link;
>> +       }
>> +

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 6/7] libbpf: Update a bpf_link with another struct_ops.
  2023-02-16 22:48   ` Andrii Nakryiko
@ 2023-02-18  0:22     ` Kui-Feng Lee
  2023-02-18  1:10       ` Andrii Nakryiko
  0 siblings, 1 reply; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-18  0:22 UTC (permalink / raw)
  To: Andrii Nakryiko, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii



On 2/16/23 14:48, Andrii Nakryiko wrote:
> On Tue, Feb 14, 2023 at 2:17 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>>
>> Introduce bpf_link__update_struct_ops(), which will allow you to
>> effortlessly transition the struct_ops map of any given bpf_link into
>> an alternative.
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   tools/lib/bpf/libbpf.c   | 35 +++++++++++++++++++++++++++++++++++
>>   tools/lib/bpf/libbpf.h   |  1 +
>>   tools/lib/bpf/libbpf.map |  1 +
>>   3 files changed, 37 insertions(+)
>>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 1eff6a03ddd9..6f7c72e312d4 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -11524,6 +11524,41 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>          return &link->link;
>>   }
>>
>> +/*
>> + * Swap the back struct_ops of a link with a new struct_ops map.
>> + */
>> +int bpf_link__update_struct_ops(struct bpf_link *link, const struct bpf_map *map)
> 
> we have bpf_link__update_program(), and so the generic counterpart for
> map-based links would be bpf_link__update_map(). Let's call it that.
> And it shouldn't probably assume so much struct_ops specific things.

Sure

> 
>> +{
>> +       struct bpf_link_struct_ops_map *st_ops_link;
>> +       int err, fd;
>> +
>> +       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
>> +               return -EINVAL;
>> +
>> +       /* Ensure the type of a link is correct */
>> +       if (link->detach != bpf_link__detach_struct_ops)
>> +               return -EINVAL;
>> +
>> +       err = bpf_map__update_vdata(map);
> 
> it's a bit weird we do this at attach time, not when bpf_map is
> actually instantiated. Should we move this map contents initialization
> to bpf_object__load() phase? Same for bpf_map__attach_struct_ops().
> What do we lose by doing it after all the BPF programs are loaded in
> load phase?

With the current behavior (w/o links), a struct_ops will be registered 
when updating its value.  If we move bpf_map__update_vdata() to 
bpf_object__load(), a congestion control algorithm will be activated at 
the moment loading it before attaching it.  However, we should activate 
an algorithm at attach time.


> 
>> +       if (err) {
>> +               err = -errno;
>> +               free(link);
>> +               return err;
>> +       }
>> +
>> +       fd = bpf_link_update(link->fd, map->fd, NULL);
>> +       if (fd < 0) {
>> +               err = -errno;
>> +               free(link);
>> +               return err;
>> +       }
>> +
>> +       st_ops_link = container_of(link, struct bpf_link_struct_ops_map, link);
>> +       st_ops_link->map_fd = map->fd;
>> +
>> +       return 0;
>> +}
>> +
>>   typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
>>                                                            void *private_data);
>>
>> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
>> index 2efd80f6f7b9..dd25cd6759d4 100644
>> --- a/tools/lib/bpf/libbpf.h
>> +++ b/tools/lib/bpf/libbpf.h
>> @@ -695,6 +695,7 @@ bpf_program__attach_freplace(const struct bpf_program *prog,
>>   struct bpf_map;
>>
>>   LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
>> +LIBBPF_API int bpf_link__update_struct_ops(struct bpf_link *link, const struct bpf_map *map);
> 
> let's rename to bpf_link__update_map() and put it next to
> bpf_link__update_program() in libbpf.h
> 
>>
>>   struct bpf_iter_attach_opts {
>>          size_t sz; /* size of this struct for forward/backward compatibility */
>> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
>> index 11c36a3c1a9f..ca6993c744b6 100644
>> --- a/tools/lib/bpf/libbpf.map
>> +++ b/tools/lib/bpf/libbpf.map
>> @@ -373,6 +373,7 @@ LIBBPF_1.1.0 {
>>          global:
>>                  bpf_btf_get_fd_by_id_opts;
>>                  bpf_link_get_fd_by_id_opts;
>> +               bpf_link__update_struct_ops;
>>                  bpf_map_get_fd_by_id_opts;
>>                  bpf_prog_get_fd_by_id_opts;
>>                  user_ring_buffer__discard;
> 
> we are in LIBBPF_1.2.0 already, please move
> 
> 
>> --
>> 2.30.2
>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 7/7] selftests/bpf: Test switching TCP Congestion Control algorithms.
  2023-02-16 22:50   ` Andrii Nakryiko
@ 2023-02-18  0:23     ` Kui-Feng Lee
  0 siblings, 0 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-18  0:23 UTC (permalink / raw)
  To: Andrii Nakryiko, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii



On 2/16/23 14:50, Andrii Nakryiko wrote:
> On Tue, Feb 14, 2023 at 2:17 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>>
>> Create a pair of sockets that utilize the congestion control algorithm
>> under a particular name. Then switch up this congestion control
>> algorithm to another implementation and check whether newly created
>> connections using the same cc name now run the new implementation.
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 48 ++++++++++++
>>   .../selftests/bpf/progs/tcp_ca_update.c       | 75 +++++++++++++++++++
>>   2 files changed, 123 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/progs/tcp_ca_update.c
>>
> 
> [...]
> 
>> diff --git a/tools/testing/selftests/bpf/progs/tcp_ca_update.c b/tools/testing/selftests/bpf/progs/tcp_ca_update.c
>> new file mode 100644
>> index 000000000000..cf51fe54ac01
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/tcp_ca_update.c
>> @@ -0,0 +1,75 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include "vmlinux.h"
>> +
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_tracing.h>
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +int ca1_cnt = 0;
>> +int ca2_cnt = 0;
>> +
>> +#define USEC_PER_SEC 1000000UL
>> +
>> +#define min(a, b) ((a) < (b) ? (a) : (b))
>> +
>> +static inline struct tcp_sock *tcp_sk(const struct sock *sk)
>> +{
>> +       return (struct tcp_sock *)sk;
>> +}
>> +
>> +SEC("struct_ops/ca_update_init")
>> +void BPF_PROG(ca_update_init, struct sock *sk)
>> +{
>> +#ifdef ENABLE_ATOMICS_TESTS
> 
> it's been 2 years since atomics were added to Clang, I think it's fine
> to just assume atomic operations are supported and not do the
> ENABLE_ATOMICS_TEST (and I'd clean up ENABLE_ATOMICS_TESTS now as
> well)

Sure!

> 
>> +       __sync_bool_compare_and_swap(&sk->sk_pacing_status, SK_PACING_NONE,
>> +                                    SK_PACING_NEEDED);
>> +#else
>> +       sk->sk_pacing_status = SK_PACING_NEEDED;
>> +#endif
>> +}
>> +
>> +SEC("struct_ops/ca_update_1_cong_control")
>> +void BPF_PROG(ca_update_1_cong_control, struct sock *sk,
>> +             const struct rate_sample *rs)
>> +{
>> +       ca1_cnt++;
>> +}
>> +
>> +SEC("struct_ops/ca_update_2_cong_control")
>> +void BPF_PROG(ca_update_2_cong_control, struct sock *sk,
>> +             const struct rate_sample *rs)
>> +{
>> +       ca2_cnt++;
>> +}
>> +
>> +SEC("struct_ops/ca_update_ssthresh")
>> +__u32 BPF_PROG(ca_update_ssthresh, struct sock *sk)
>> +{
>> +       return tcp_sk(sk)->snd_ssthresh;
>> +}
>> +
>> +SEC("struct_ops/ca_update_undo_cwnd")
>> +__u32 BPF_PROG(ca_update_undo_cwnd, struct sock *sk)
>> +{
>> +       return tcp_sk(sk)->snd_cwnd;
>> +}
>> +
>> +SEC(".struct_ops")
>> +struct tcp_congestion_ops ca_update_1 = {
>> +       .init = (void *)ca_update_init,
>> +       .cong_control = (void *)ca_update_1_cong_control,
>> +       .ssthresh = (void *)ca_update_ssthresh,
>> +       .undo_cwnd = (void *)ca_update_undo_cwnd,
>> +       .name = "tcp_ca_update",
>> +};
>> +
>> +SEC(".struct_ops")
>> +struct tcp_congestion_ops ca_update_2 = {
>> +       .init = (void *)ca_update_init,
>> +       .cong_control = (void *)ca_update_2_cong_control,
>> +       .ssthresh = (void *)ca_update_ssthresh,
>> +       .undo_cwnd = (void *)ca_update_undo_cwnd,
>> +       .name = "tcp_ca_update",
>> +};
>> --
>> 2.30.2
>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-02-18  0:05     ` Kui-Feng Lee
@ 2023-02-18  1:08       ` Andrii Nakryiko
  0 siblings, 0 replies; 43+ messages in thread
From: Andrii Nakryiko @ 2023-02-18  1:08 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii

On Fri, Feb 17, 2023 at 4:05 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 2/16/23 14:40, Andrii Nakryiko wrote:
> > On Tue, Feb 14, 2023 at 2:17 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
> >>
> >> bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
> >> placeholder, but now it is constructing an authentic one by calling
> >> bpf_link_create() if the map has the BPF_F_LINK flag.
> >>
> >> You can flag a struct_ops map with BPF_F_LINK by calling
> >> bpf_map__set_map_flags().
> >>
> >> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> >> ---
> >>   tools/lib/bpf/libbpf.c | 73 +++++++++++++++++++++++++++++++++---------
> >>   1 file changed, 58 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> >> index 75ed95b7e455..1eff6a03ddd9 100644
> >> --- a/tools/lib/bpf/libbpf.c
> >> +++ b/tools/lib/bpf/libbpf.c
> >> @@ -11430,29 +11430,41 @@ struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
> >>          return link;
> >>   }
> >>
> >> +struct bpf_link_struct_ops_map {
> >
> > let's drop the "_map" suffix? struct_ops is always a map, so no need
> > to point this out
>
> Sure!
>
> >
> >> +       struct bpf_link link;
> >> +       int map_fd;
> >> +};
> >> +
> >>   static int bpf_link__detach_struct_ops(struct bpf_link *link)
> >>   {
> >> +       struct bpf_link_struct_ops_map *st_link;
> >>          __u32 zero = 0;
> >>
> >> -       if (bpf_map_delete_elem(link->fd, &zero))
> >> +       st_link = container_of(link, struct bpf_link_struct_ops_map, link);
> >> +
> >> +       if (st_link->map_fd < 0) {
> >> +               /* Fake bpf_link */
> >> +               if (bpf_map_delete_elem(link->fd, &zero))
> >> +                       return -errno;
> >> +               return 0;
> >> +       }
> >> +
> >> +       if (bpf_map_delete_elem(st_link->map_fd, &zero))
> >> +               return -errno;
> >> +
> >> +       if (close(link->fd))
> >>                  return -errno;
> >>
> >>          return 0;
> >>   }
> >>
> >> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> >> +/*
> >> + * Update the map with the prepared vdata.
> >> + */
> >> +static int bpf_map__update_vdata(const struct bpf_map *map)
> >
> > this is internal helper, so let's not use double underscores, just
> > bpf_map_update_vdata()
>
> Ok!
>
> >
> >>   {
> >>          struct bpf_struct_ops *st_ops;
> >> -       struct bpf_link *link;
> >>          __u32 i, zero = 0;
> >> -       int err;
> >> -
> >> -       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> >> -               return libbpf_err_ptr(-EINVAL);
> >> -
> >> -       link = calloc(1, sizeof(*link));
> >> -       if (!link)
> >> -               return libbpf_err_ptr(-EINVAL);
> >>
> >>          st_ops = map->st_ops;
> >>          for (i = 0; i < btf_vlen(st_ops->type); i++) {
> >> @@ -11468,17 +11480,48 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> >>                  *(unsigned long *)kern_data = prog_fd;
> >>          }
> >>
> >> -       err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
> >> +       return bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
> >> +}
> >> +
> >> +struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> >> +{
> >> +       struct bpf_link_struct_ops_map *link;
> >> +       int err, fd;
> >> +
> >> +       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> >> +               return libbpf_err_ptr(-EINVAL);
> >> +
> >> +       link = calloc(1, sizeof(*link));
> >> +       if (!link)
> >> +               return libbpf_err_ptr(-EINVAL);
> >> +
> >> +       err = bpf_map__update_vdata(map);
> >>          if (err) {
> >>                  err = -errno;
> >>                  free(link);
> >>                  return libbpf_err_ptr(err);
> >>          }
> >>
> >> -       link->detach = bpf_link__detach_struct_ops;
> >> -       link->fd = map->fd;
> >> +       link->link.detach = bpf_link__detach_struct_ops;
> >>
> >> -       return link;
> >> +       if (!(map->def.map_flags & BPF_F_LINK)) {
> >
> > So this will always require a programmatic bpf_map__set_map_flags()
> > call, there is currently no declarative way to do this, right?
> >
> > Is there any way to avoid this BPF_F_LINK flag approach? How bad would
> > it be if kernel just always created bpf_link-backed struct_ops?
> >
> > Alternatively, should we think about SEC(".struct_ops.link") or
> > something like that to instruct libbpf to add this BPF_F_LINK flag
> > automatically?
>
> Agree!
>
> The other solution is to add a flag when declare a struct_ops.
>
>   SEC(".struct_ops")
>   tcp_congestion_ops ops = {
>       ...
>       .flags = WITH_LINK,
>   }
>

tcp_congestion_ops is defined in kernel and used by kernel internal
code. I don't think randomly setting and passing extra flag is generic
solution that will work for all struct_ops kinds.

>
> >
> >> +               /* Fake bpf_link */
> >> +               link->link.fd = map->fd;
> >> +               link->map_fd = -1;
> >> +               return &link->link;
> >> +       }
> >> +

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 6/7] libbpf: Update a bpf_link with another struct_ops.
  2023-02-18  0:22     ` Kui-Feng Lee
@ 2023-02-18  1:10       ` Andrii Nakryiko
  2023-02-21 22:20         ` Kui-Feng Lee
  0 siblings, 1 reply; 43+ messages in thread
From: Andrii Nakryiko @ 2023-02-18  1:10 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii

On Fri, Feb 17, 2023 at 4:22 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 2/16/23 14:48, Andrii Nakryiko wrote:
> > On Tue, Feb 14, 2023 at 2:17 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
> >>
> >> Introduce bpf_link__update_struct_ops(), which will allow you to
> >> effortlessly transition the struct_ops map of any given bpf_link into
> >> an alternative.
> >>
> >> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> >> ---
> >>   tools/lib/bpf/libbpf.c   | 35 +++++++++++++++++++++++++++++++++++
> >>   tools/lib/bpf/libbpf.h   |  1 +
> >>   tools/lib/bpf/libbpf.map |  1 +
> >>   3 files changed, 37 insertions(+)
> >>
> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> >> index 1eff6a03ddd9..6f7c72e312d4 100644
> >> --- a/tools/lib/bpf/libbpf.c
> >> +++ b/tools/lib/bpf/libbpf.c
> >> @@ -11524,6 +11524,41 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> >>          return &link->link;
> >>   }
> >>
> >> +/*
> >> + * Swap the back struct_ops of a link with a new struct_ops map.
> >> + */
> >> +int bpf_link__update_struct_ops(struct bpf_link *link, const struct bpf_map *map)
> >
> > we have bpf_link__update_program(), and so the generic counterpart for
> > map-based links would be bpf_link__update_map(). Let's call it that.
> > And it shouldn't probably assume so much struct_ops specific things.
>
> Sure
>
> >
> >> +{
> >> +       struct bpf_link_struct_ops_map *st_ops_link;
> >> +       int err, fd;
> >> +
> >> +       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
> >> +               return -EINVAL;
> >> +
> >> +       /* Ensure the type of a link is correct */
> >> +       if (link->detach != bpf_link__detach_struct_ops)
> >> +               return -EINVAL;
> >> +
> >> +       err = bpf_map__update_vdata(map);
> >
> > it's a bit weird we do this at attach time, not when bpf_map is
> > actually instantiated. Should we move this map contents initialization
> > to bpf_object__load() phase? Same for bpf_map__attach_struct_ops().
> > What do we lose by doing it after all the BPF programs are loaded in
> > load phase?
>
> With the current behavior (w/o links), a struct_ops will be registered
> when updating its value.  If we move bpf_map__update_vdata() to
> bpf_object__load(), a congestion control algorithm will be activated at
> the moment loading it before attaching it.  However, we should activate
> an algorithm at attach time.
>

Of course. But I was thinking to move `bpf_map_update_elem(map->fd,
&zero, st_ops->kern_vdata, 0);` part out of bpf_map__update_vdata()
and make update_vdata() just prepare st_ops->kern_vdata only.

>
> >
> >> +       if (err) {
> >> +               err = -errno;
> >> +               free(link);
> >> +               return err;
> >> +       }
> >> +
> >> +       fd = bpf_link_update(link->fd, map->fd, NULL);
> >> +       if (fd < 0) {
> >> +               err = -errno;
> >> +               free(link);
> >> +               return err;
> >> +       }
> >> +
> >> +       st_ops_link = container_of(link, struct bpf_link_struct_ops_map, link);
> >> +       st_ops_link->map_fd = map->fd;
> >> +
> >> +       return 0;
> >> +}
> >> +
> >>   typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
> >>                                                            void *private_data);
> >>
> >> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> >> index 2efd80f6f7b9..dd25cd6759d4 100644
> >> --- a/tools/lib/bpf/libbpf.h
> >> +++ b/tools/lib/bpf/libbpf.h
> >> @@ -695,6 +695,7 @@ bpf_program__attach_freplace(const struct bpf_program *prog,
> >>   struct bpf_map;
> >>
> >>   LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> >> +LIBBPF_API int bpf_link__update_struct_ops(struct bpf_link *link, const struct bpf_map *map);
> >
> > let's rename to bpf_link__update_map() and put it next to
> > bpf_link__update_program() in libbpf.h
> >
> >>
> >>   struct bpf_iter_attach_opts {
> >>          size_t sz; /* size of this struct for forward/backward compatibility */
> >> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> >> index 11c36a3c1a9f..ca6993c744b6 100644
> >> --- a/tools/lib/bpf/libbpf.map
> >> +++ b/tools/lib/bpf/libbpf.map
> >> @@ -373,6 +373,7 @@ LIBBPF_1.1.0 {
> >>          global:
> >>                  bpf_btf_get_fd_by_id_opts;
> >>                  bpf_link_get_fd_by_id_opts;
> >> +               bpf_link__update_struct_ops;
> >>                  bpf_map_get_fd_by_id_opts;
> >>                  bpf_prog_get_fd_by_id_opts;
> >>                  user_ring_buffer__discard;
> >
> > we are in LIBBPF_1.2.0 already, please move
> >
> >
> >> --
> >> 2.30.2
> >>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next 6/7] libbpf: Update a bpf_link with another struct_ops.
  2023-02-18  1:10       ` Andrii Nakryiko
@ 2023-02-21 22:20         ` Kui-Feng Lee
  0 siblings, 0 replies; 43+ messages in thread
From: Kui-Feng Lee @ 2023-02-21 22:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii



On 2/17/23 17:10, Andrii Nakryiko wrote:
> On Fri, Feb 17, 2023 at 4:22 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 2/16/23 14:48, Andrii Nakryiko wrote:
>>> On Tue, Feb 14, 2023 at 2:17 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>>>>
>>>> Introduce bpf_link__update_struct_ops(), which will allow you to
>>>> effortlessly transition the struct_ops map of any given bpf_link into
>>>> an alternative.
>>>>
>>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>>>> ---
>>>>    tools/lib/bpf/libbpf.c   | 35 +++++++++++++++++++++++++++++++++++
>>>>    tools/lib/bpf/libbpf.h   |  1 +
>>>>    tools/lib/bpf/libbpf.map |  1 +
>>>>    3 files changed, 37 insertions(+)
>>>>
>>>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>>>> index 1eff6a03ddd9..6f7c72e312d4 100644
>>>> --- a/tools/lib/bpf/libbpf.c
>>>> +++ b/tools/lib/bpf/libbpf.c
>>>> @@ -11524,6 +11524,41 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>>>           return &link->link;
>>>>    }
>>>>
>>>> +/*
>>>> + * Swap the back struct_ops of a link with a new struct_ops map.
>>>> + */
>>>> +int bpf_link__update_struct_ops(struct bpf_link *link, const struct bpf_map *map)
>>>
>>> we have bpf_link__update_program(), and so the generic counterpart for
>>> map-based links would be bpf_link__update_map(). Let's call it that.
>>> And it shouldn't probably assume so much struct_ops specific things.
>>
>> Sure
>>
>>>
>>>> +{
>>>> +       struct bpf_link_struct_ops_map *st_ops_link;
>>>> +       int err, fd;
>>>> +
>>>> +       if (!bpf_map__is_struct_ops(map) || map->fd == -1)
>>>> +               return -EINVAL;
>>>> +
>>>> +       /* Ensure the type of a link is correct */
>>>> +       if (link->detach != bpf_link__detach_struct_ops)
>>>> +               return -EINVAL;
>>>> +
>>>> +       err = bpf_map__update_vdata(map);
>>>
>>> it's a bit weird we do this at attach time, not when bpf_map is
>>> actually instantiated. Should we move this map contents initialization
>>> to bpf_object__load() phase? Same for bpf_map__attach_struct_ops().
>>> What do we lose by doing it after all the BPF programs are loaded in
>>> load phase?
>>
>> With the current behavior (w/o links), a struct_ops will be registered
>> when updating its value.  If we move bpf_map__update_vdata() to
>> bpf_object__load(), a congestion control algorithm will be activated at
>> the moment loading it before attaching it.  However, we should activate
>> an algorithm at attach time.
>>
> 
> Of course. But I was thinking to move `bpf_map_update_elem(map->fd,
> &zero, st_ops->kern_vdata, 0);` part out of bpf_map__update_vdata()
> and make update_vdata() just prepare st_ops->kern_vdata only.

Ok! I will rename it as bpf_map_prepare_vdata(), and call 
bpf_map_update_elem() separately.


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2023-02-21 22:20 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-14 22:17 [PATCH bpf-next 0/7] Transit between BPF TCP congestion controls Kui-Feng Lee
2023-02-14 22:17 ` [PATCH bpf-next 1/7] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
2023-02-15  0:26   ` kernel test robot
2023-02-15  2:39   ` Stanislav Fomichev
2023-02-15 18:04     ` Kui-Feng Lee
2023-02-15 18:44       ` Stanislav Fomichev
2023-02-15 20:24         ` Kui-Feng Lee
2023-02-15 21:28           ` Martin KaFai Lau
2023-02-15 20:30       ` Martin KaFai Lau
2023-02-15 20:55         ` Kui-Feng Lee
2023-02-15 22:58   ` Martin KaFai Lau
2023-02-16 17:59     ` Kui-Feng Lee
2023-02-14 22:17 ` [PATCH bpf-next 2/7] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
2023-02-15  2:43   ` Stanislav Fomichev
2023-02-15 18:15     ` Kui-Feng Lee
2023-02-14 22:17 ` [PATCH bpf-next 3/7] bpf: Register and unregister a struct_ops by their bpf_links Kui-Feng Lee
2023-02-15  2:53   ` Stanislav Fomichev
2023-02-15 18:29     ` Kui-Feng Lee
2023-02-16  0:37   ` Martin KaFai Lau
2023-02-16 16:42     ` Kui-Feng Lee
2023-02-16 22:38       ` Martin KaFai Lau
2023-02-17 22:17         ` Kui-Feng Lee
2023-02-14 22:17 ` [PATCH bpf-next 4/7] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
2023-02-15  2:58   ` Stanislav Fomichev
2023-02-15 18:44     ` Kui-Feng Lee
2023-02-15 18:48       ` Stanislav Fomichev
2023-02-15 22:20         ` Kui-Feng Lee
2023-02-16 22:40   ` Andrii Nakryiko
2023-02-16 22:59     ` Martin KaFai Lau
2023-02-18  0:05     ` Kui-Feng Lee
2023-02-18  1:08       ` Andrii Nakryiko
2023-02-14 22:17 ` [PATCH bpf-next 5/7] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
2023-02-16  1:02   ` Martin KaFai Lau
2023-02-16 19:17     ` Kui-Feng Lee
2023-02-16 19:40       ` Martin KaFai Lau
2023-02-14 22:17 ` [PATCH bpf-next 6/7] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
2023-02-16 22:48   ` Andrii Nakryiko
2023-02-18  0:22     ` Kui-Feng Lee
2023-02-18  1:10       ` Andrii Nakryiko
2023-02-21 22:20         ` Kui-Feng Lee
2023-02-14 22:17 ` [PATCH bpf-next 7/7] selftests/bpf: Test switching TCP Congestion Control algorithms Kui-Feng Lee
2023-02-16 22:50   ` Andrii Nakryiko
2023-02-18  0:23     ` Kui-Feng Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox