[PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs
@ 2022-10-14  4:56 Yonghong Song
  2022-10-14  4:56 ` [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global Yonghong Song
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-14  4:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	KP Singh, Martin KaFai Lau, Tejun Heo

There already exists a local storage implementation for cgroup-attached
bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.

This patch implemented new cgroup local storage available to
non-cgroup-attached bpf programs. In the patch series, Patch 1
is a preparation patch. Patch 2 implemented new cgroup local storage
kernel support. Patches 3 and 4 implemented libbpf and bpftool support.
Patch 5 added two tests to validate kernel/libbpf implementations.

Yonghong Song (5):
  bpf: Make struct cgroup btf id global
  bpf: Implement cgroup storage available to non-cgroup-attached bpf
    progs
  libbpf: Support new cgroup local storage
  bpftool: Support new cgroup local storage
  selftests/bpf: Add selftests for cgroup local storage

 include/linux/bpf.h                           |   3 +
 include/linux/bpf_types.h                     |   1 +
 include/linux/btf_ids.h                       |   1 +
 include/linux/cgroup-defs.h                   |   4 +
 include/uapi/linux/bpf.h                      |  39 +++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_cgroup_storage.c               | 280 ++++++++++++++++++
 kernel/bpf/cgroup_iter.c                      |   2 +-
 kernel/bpf/helpers.c                          |   6 +
 kernel/bpf/syscall.c                          |   3 +-
 kernel/bpf/verifier.c                         |  14 +-
 kernel/cgroup/cgroup.c                        |   4 +
 kernel/trace/bpf_trace.c                      |   4 +
 scripts/bpf_doc.py                            |   2 +
 .../bpf/bpftool/Documentation/bpftool-map.rst |   2 +-
 tools/bpf/bpftool/map.c                       |   2 +-
 tools/include/uapi/linux/bpf.h                |  39 +++
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 .../bpf/prog_tests/cgroup_local_storage.c     |  92 ++++++
 .../bpf/progs/cgroup_local_storage.c          |  88 ++++++
 .../selftests/bpf/progs/cgroup_ls_recursion.c |  70 +++++
 22 files changed, 654 insertions(+), 6 deletions(-)
 create mode 100644 kernel/bpf/bpf_cgroup_storage.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c

-- 
2.30.2

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global
  2022-10-14  4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
@ 2022-10-14  4:56 ` Yonghong Song
  2022-10-14  4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-14  4:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	KP Singh, Martin KaFai Lau, Tejun Heo

Make struct cgroup btf id global so later patch can reuse
the same btf id.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/btf_ids.h  | 1 +
 kernel/bpf/cgroup_iter.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
index 2aea877d644f..c9744efd202f 100644
--- a/include/linux/btf_ids.h
+++ b/include/linux/btf_ids.h
@@ -265,5 +265,6 @@ MAX_BTF_TRACING_TYPE,
 };
 
 extern u32 btf_tracing_ids[];
+extern u32 bpf_cgroup_btf_id[];
 
 #endif
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
index 0d200a993489..c6ffc706d583 100644
--- a/kernel/bpf/cgroup_iter.c
+++ b/kernel/bpf/cgroup_iter.c
@@ -157,7 +157,7 @@ static const struct seq_operations cgroup_iter_seq_ops = {
 	.show   = cgroup_iter_seq_show,
 };
 
-BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+BTF_ID_LIST_GLOBAL_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
 
 static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-14  4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
  2022-10-14  4:56 ` [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global Yonghong Song
@ 2022-10-14  4:56 ` Yonghong Song
  2022-10-17 18:01   ` sdf
  2022-10-17 18:16   ` David Vernet
  2022-10-14  4:56 ` [PATCH bpf-next 3/5] libbpf: Support new cgroup local storage Yonghong Song
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-14  4:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	KP Singh, Martin KaFai Lau, Tejun Heo

Similar to sk/inode/task storage, implement similar cgroup local storage.

There already exists a local storage implementation for cgroup-attached
bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.

The life-cycle of storage is managed with the life-cycle of the
cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
with a callback to the bpf_cgroup_storage_free when cgroup itself
is deleted.

The userspace map operations can be done by using a cgroup fd as a key
passed to the lookup, update and delete operations.

Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is used
for cgroup storage available to non-cgroup-attached bpf programs. The two
helpers are named as bpf_cgroup_local_storage_get() and
bpf_cgroup_local_storage_delete().

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h             |   3 +
 include/linux/bpf_types.h       |   1 +
 include/linux/cgroup-defs.h     |   4 +
 include/uapi/linux/bpf.h        |  39 +++++
 kernel/bpf/Makefile             |   2 +-
 kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
 kernel/bpf/helpers.c            |   6 +
 kernel/bpf/syscall.c            |   3 +-
 kernel/bpf/verifier.c           |  14 +-
 kernel/cgroup/cgroup.c          |   4 +
 kernel/trace/bpf_trace.c        |   4 +
 scripts/bpf_doc.py              |   2 +
 tools/include/uapi/linux/bpf.h  |  39 +++++
 13 files changed, 398 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/bpf_cgroup_storage.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9e7d46d16032..1395a01c7f18 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
 
 const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id);
 void bpf_task_storage_free(struct task_struct *task);
+void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
 bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
 const struct btf_func_model *
 bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
@@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto bpf_copy_from_user_task_proto;
 extern const struct bpf_func_proto bpf_set_retval_proto;
 extern const struct bpf_func_proto bpf_get_retval_proto;
 extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
+extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
+extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2c6a4f2562a7..7a0362d7a0aa 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
 #ifdef CONFIG_CGROUP_BPF
 BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, cgroup_local_storage_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 4bcf56b3491c..c6f4590dda68 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -504,6 +504,10 @@ struct cgroup {
 	/* Used to store internal freezer state */
 	struct cgroup_freezer_state freezer;
 
+#ifdef CONFIG_BPF_SYSCALL
+	struct bpf_local_storage __rcu  *bpf_cgroup_storage;
+#endif
+
 	/* ids of the ancestors at each level including self */
 	u64 ancestor_ids[];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 17f61338f8f8..d918b4054297 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -935,6 +935,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
+	BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
 };
 
 /* Note that tracing related programs such as
@@ -5435,6 +5436,42 @@ union bpf_attr {
  *		**-E2BIG** if user-space has tried to publish a sample which is
  *		larger than the size of the ring buffer, or which cannot fit
  *		within a struct bpf_dynptr.
+ *
+ * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
+ *	Description
+ *		Get a bpf_local_storage from the *cgroup*.
+ *
+ *		Logically, it could be thought of as getting the value from
+ *		a *map* with *cgroup* as the **key**.  From this
+ *		perspective,  the usage is not much different from
+ *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
+ *		helper enforces the key must be a cgroup struct and the map must also
+ *		be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
+ *
+ *		Underneath, the value is stored locally at *cgroup* instead of
+ *		the *map*.  The *map* is used as the bpf-local-storage
+ *		"type". The bpf-local-storage "type" (i.e. the *map*) is
+ *		searched against all bpf_local_storage residing at *cgroup*.
+ *
+ *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
+ *		used such that a new bpf_local_storage will be
+ *		created if one does not exist.  *value* can be used
+ *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
+ *		the initial value of a bpf_local_storage.  If *value* is
+ *		**NULL**, the new bpf_local_storage will be zero initialized.
+ *	Return
+ *		A bpf_local_storage pointer is returned on success.
+ *
+ *		**NULL** if not found or there was an error in adding
+ *		a new bpf_local_storage.
+ *
+ * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
+ *	Description
+ *		Delete a bpf_local_storage from a *cgroup*.
+ *	Return
+ *		0 on success.
+ *
+ *		**-ENOENT** if the bpf_local_storage cannot be found.
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
 	FN(unspec, 0, ##ctx)				\
@@ -5647,6 +5684,8 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)	\
 	FN(ktime_get_tai_ns, 208, ##ctx)		\
 	FN(user_ringbuf_drain, 209, ##ctx)		\
+	FN(cgroup_local_storage_get, 210, ##ctx)	\
+	FN(cgroup_local_storage_delete, 211, ##ctx)	\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..b02693f51978 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
 ifeq ($(CONFIG_CGROUPS),y)
-obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
+obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
 endif
 obj-$(CONFIG_CGROUP_BPF) += cgroup.o
 ifeq ($(CONFIG_INET),y)
diff --git a/kernel/bpf/bpf_cgroup_storage.c b/kernel/bpf/bpf_cgroup_storage.c
new file mode 100644
index 000000000000..9974784822da
--- /dev/null
+++ b/kernel/bpf/bpf_cgroup_storage.c
@@ -0,0 +1,280 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ */
+
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <linux/bpf_local_storage.h>
+#include <uapi/linux/btf.h>
+#include <linux/btf_ids.h>
+
+DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
+
+static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
+
+static void bpf_cgroup_storage_lock(void)
+{
+	migrate_disable();
+	this_cpu_inc(bpf_cgroup_storage_busy);
+}
+
+static void bpf_cgroup_storage_unlock(void)
+{
+	this_cpu_dec(bpf_cgroup_storage_busy);
+	migrate_enable();
+}
+
+static bool bpf_cgroup_storage_trylock(void)
+{
+	migrate_disable();
+	if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
+		this_cpu_dec(bpf_cgroup_storage_busy);
+		migrate_enable();
+		return false;
+	}
+	return true;
+}
+
+static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
+{
+	struct cgroup *cg = owner;
+
+	return &cg->bpf_cgroup_storage;
+}
+
+void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
+{
+	struct bpf_local_storage *local_storage;
+	struct bpf_local_storage_elem *selem;
+	bool free_cgroup_storage = false;
+	struct hlist_node *n;
+	unsigned long flags;
+
+	rcu_read_lock();
+	local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
+	if (!local_storage) {
+		rcu_read_unlock();
+		return;
+	}
+
+	/* Neither the bpf_prog nor the bpf-map's syscall
+	 * could be modifying the local_storage->list now.
+	 * Thus, no elem can be added-to or deleted-from the
+	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
+	 *
+	 * It is racing with bpf_local_storage_map_free() alone
+	 * when unlinking elem from the local_storage->list and
+	 * the map's bucket->list.
+	 */
+	bpf_cgroup_storage_lock();
+	raw_spin_lock_irqsave(&local_storage->lock, flags);
+	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
+		bpf_selem_unlink_map(selem);
+		free_cgroup_storage =
+			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
+	}
+	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
+	bpf_cgroup_storage_unlock();
+	rcu_read_unlock();
+
+	/* free_cgroup_storage should always be true as long as
+	 * local_storage->list was non-empty.
+	 */
+	if (free_cgroup_storage)
+		kfree_rcu(local_storage, rcu);
+}
+
+static struct bpf_local_storage_data *
+cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
+{
+	struct bpf_local_storage *cgroup_storage;
+	struct bpf_local_storage_map *smap;
+
+	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
+					       bpf_rcu_lock_held());
+	if (!cgroup_storage)
+		return NULL;
+
+	smap = (struct bpf_local_storage_map *)map;
+	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
+}
+
+static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_local_storage_data *sdata;
+	struct cgroup *cgroup;
+	int fd;
+
+	fd = *(int *)key;
+	cgroup = cgroup_get_from_fd(fd);
+	if (IS_ERR(cgroup))
+		return ERR_CAST(cgroup);
+
+	bpf_cgroup_storage_lock();
+	sdata = cgroup_storage_lookup(cgroup, map, true);
+	bpf_cgroup_storage_unlock();
+	cgroup_put(cgroup);
+	return sdata ? sdata->data : NULL;
+}
+
+static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
+					  void *value, u64 map_flags)
+{
+	struct bpf_local_storage_data *sdata;
+	struct cgroup *cgroup;
+	int err, fd;
+
+	fd = *(int *)key;
+	cgroup = cgroup_get_from_fd(fd);
+	if (IS_ERR(cgroup))
+		return PTR_ERR(cgroup);
+
+	bpf_cgroup_storage_lock();
+	sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
+					 value, map_flags, GFP_ATOMIC);
+	bpf_cgroup_storage_unlock();
+	err = PTR_ERR_OR_ZERO(sdata);
+	cgroup_put(cgroup);
+	return err;
+}
+
+static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map)
+{
+	struct bpf_local_storage_data *sdata;
+
+	sdata = cgroup_storage_lookup(cgroup, map, false);
+	if (!sdata)
+		return -ENOENT;
+
+	bpf_selem_unlink(SELEM(sdata), true);
+	return 0;
+}
+
+static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key)
+{
+	struct cgroup *cgroup;
+	int err, fd;
+
+	fd = *(int *)key;
+	cgroup = cgroup_get_from_fd(fd);
+	if (IS_ERR(cgroup))
+		return PTR_ERR(cgroup);
+
+	bpf_cgroup_storage_lock();
+	err = cgroup_storage_delete(cgroup, map);
+	bpf_cgroup_storage_unlock();
+	if (err)
+		return err;
+
+	cgroup_put(cgroup);
+	return 0;
+}
+
+static int notsupp_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	return -ENOTSUPP;
+}
+
+static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_local_storage_map *smap;
+
+	smap = bpf_local_storage_map_alloc(attr);
+	if (IS_ERR(smap))
+		return ERR_CAST(smap);
+
+	smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache);
+	return &smap->map;
+}
+
+static void cgroup_storage_map_free(struct bpf_map *map)
+{
+	struct bpf_local_storage_map *smap;
+
+	smap = (struct bpf_local_storage_map *)map;
+	bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx);
+	bpf_local_storage_map_free(smap, NULL);
+}
+
+/* *gfp_flags* is a hidden argument provided by the verifier */
+BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
+	   void *, value, u64, flags, gfp_t, gfp_flags)
+{
+	struct bpf_local_storage_data *sdata;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
+		return (unsigned long)NULL;
+
+	if (!cgroup)
+		return (unsigned long)NULL;
+
+	if (!bpf_cgroup_storage_trylock())
+		return (unsigned long)NULL;
+
+	sdata = cgroup_storage_lookup(cgroup, map, true);
+	if (sdata)
+		goto unlock;
+
+	/* only allocate new storage, when the cgroup is refcounted */
+	if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
+	    (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
+		sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
+						 value, BPF_NOEXIST, gfp_flags);
+
+unlock:
+	bpf_cgroup_storage_unlock();
+	return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
+}
+
+BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct cgroup *, cgroup)
+{
+	int ret;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	if (!cgroup)
+		return -EINVAL;
+
+	if (!bpf_cgroup_storage_trylock())
+		return -EBUSY;
+
+	ret = cgroup_storage_delete(cgroup, map);
+	bpf_cgroup_storage_unlock();
+	return ret;
+}
+
+BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map)
+const struct bpf_map_ops cgroup_local_storage_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = bpf_local_storage_map_alloc_check,
+	.map_alloc = cgroup_storage_map_alloc,
+	.map_free = cgroup_storage_map_free,
+	.map_get_next_key = notsupp_get_next_key,
+	.map_lookup_elem = bpf_cgroup_storage_lookup_elem,
+	.map_update_elem = bpf_cgroup_storage_update_elem,
+	.map_delete_elem = bpf_cgroup_storage_delete_elem,
+	.map_check_btf = bpf_local_storage_map_check_btf,
+	.map_btf_id = &cgroup_storage_map_btf_ids[0],
+	.map_owner_storage_ptr = cgroup_storage_ptr,
+};
+
+const struct bpf_func_proto bpf_cgroup_storage_get_proto = {
+	.func		= bpf_cgroup_storage_get,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_btf_id	= &bpf_cgroup_btf_id[0],
+	.arg3_type	= ARG_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg4_type	= ARG_ANYTHING,
+};
+
+const struct bpf_func_proto bpf_cgroup_storage_delete_proto = {
+	.func		= bpf_cgroup_storage_delete,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_btf_id	= &bpf_cgroup_btf_id[0],
+};
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a6b04faed282..5c5bb08832ec 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_dynptr_write_proto;
 	case BPF_FUNC_dynptr_data:
 		return &bpf_dynptr_data_proto;
+#ifdef CONFIG_CGROUPS
+	case BPF_FUNC_cgroup_local_storage_get:
+		return &bpf_cgroup_storage_get_proto;
+	case BPF_FUNC_cgroup_local_storage_delete:
+		return &bpf_cgroup_storage_delete_proto;
+#endif
 	default:
 		break;
 	}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7b373a5e861f..e53c7fae6e22 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		    map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
-		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
+		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
+		    map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
 			return -ENOTSUPP;
 		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
 		    map->value_size) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6f6d2d511c06..f36f6a3c0d50 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_task_storage_delete)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
+		if (func_id != BPF_FUNC_cgroup_local_storage_get &&
+		    func_id != BPF_FUNC_cgroup_local_storage_delete)
+			goto error;
+		break;
 	case BPF_MAP_TYPE_BLOOM_FILTER:
 		if (func_id != BPF_FUNC_map_peek_elem &&
 		    func_id != BPF_FUNC_map_push_elem)
@@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
 			goto error;
 		break;
+	case BPF_FUNC_cgroup_local_storage_get:
+	case BPF_FUNC_cgroup_local_storage_delete:
+		if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		case BPF_MAP_TYPE_INODE_STORAGE:
 		case BPF_MAP_TYPE_SK_STORAGE:
 		case BPF_MAP_TYPE_TASK_STORAGE:
+		case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
 			break;
 		default:
 			verbose(env,
@@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 
 		if (insn->imm == BPF_FUNC_task_storage_get ||
 		    insn->imm == BPF_FUNC_sk_storage_get ||
-		    insn->imm == BPF_FUNC_inode_storage_get) {
+		    insn->imm == BPF_FUNC_inode_storage_get ||
+		    insn->imm == BPF_FUNC_cgroup_local_storage_get) {
 			if (env->prog->aux->sleepable)
 				insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL);
 			else
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8ad2c267ff47..2fa2c950c7fb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
 		put_css_set_locked(cset->dom_cset);
 	}
 
+#ifdef CONFIG_BPF_SYSCALL
+	bpf_local_cgroup_storage_free(cset->dfl_cgrp);
+#endif
+
 	kfree_rcu(cset, rcu_head);
 }
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 688552df95ca..179adaae4a9f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_current_cgroup_id_proto;
 	case BPF_FUNC_get_current_ancestor_cgroup_id:
 		return &bpf_get_current_ancestor_cgroup_id_proto;
+	case BPF_FUNC_cgroup_local_storage_get:
+		return &bpf_cgroup_storage_get_proto;
+	case BPF_FUNC_cgroup_local_storage_delete:
+		return &bpf_cgroup_storage_delete_proto;
 #endif
 	case BPF_FUNC_send_signal:
 		return &bpf_send_signal_proto;
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index c0e6690be82a..fdb0aff8cb5a 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
             'struct udp6_sock',
             'struct unix_sock',
             'struct task_struct',
+            'struct cgroup',
 
             'struct __sk_buff',
             'struct sk_msg_md',
@@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
             'struct udp6_sock',
             'struct unix_sock',
             'struct task_struct',
+            'struct cgroup',
             'struct path',
             'struct btf_ptr',
             'struct inode',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 17f61338f8f8..d918b4054297 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -935,6 +935,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
+	BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
 };
 
 /* Note that tracing related programs such as
@@ -5435,6 +5436,42 @@ union bpf_attr {
  *		**-E2BIG** if user-space has tried to publish a sample which is
  *		larger than the size of the ring buffer, or which cannot fit
  *		within a struct bpf_dynptr.
+ *
+ * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
+ *	Description
+ *		Get a bpf_local_storage from the *cgroup*.
+ *
+ *		Logically, it could be thought of as getting the value from
+ *		a *map* with *cgroup* as the **key**.  From this
+ *		perspective,  the usage is not much different from
+ *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
+ *		helper enforces the key must be a cgroup struct and the map must also
+ *		be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
+ *
+ *		Underneath, the value is stored locally at *cgroup* instead of
+ *		the *map*.  The *map* is used as the bpf-local-storage
+ *		"type". The bpf-local-storage "type" (i.e. the *map*) is
+ *		searched against all bpf_local_storage residing at *cgroup*.
+ *
+ *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
+ *		used such that a new bpf_local_storage will be
+ *		created if one does not exist.  *value* can be used
+ *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
+ *		the initial value of a bpf_local_storage.  If *value* is
+ *		**NULL**, the new bpf_local_storage will be zero initialized.
+ *	Return
+ *		A bpf_local_storage pointer is returned on success.
+ *
+ *		**NULL** if not found or there was an error in adding
+ *		a new bpf_local_storage.
+ *
+ * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
+ *	Description
+ *		Delete a bpf_local_storage from a *cgroup*.
+ *	Return
+ *		0 on success.
+ *
+ *		**-ENOENT** if the bpf_local_storage cannot be found.
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
 	FN(unspec, 0, ##ctx)				\
@@ -5647,6 +5684,8 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)	\
 	FN(ktime_get_tai_ns, 208, ##ctx)		\
 	FN(user_ringbuf_drain, 209, ##ctx)		\
+	FN(cgroup_local_storage_get, 210, ##ctx)	\
+	FN(cgroup_local_storage_delete, 211, ##ctx)	\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-14  4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
@ 2022-10-17 18:01   ` sdf
  2022-10-17 18:25     ` Yosry Ahmed
                       ` (2 more replies)
  2022-10-17 18:16   ` David Vernet
  1 sibling, 3 replies; 38+ messages in thread
From: sdf @ 2022-10-17 18:01 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

On 10/13, Yonghong Song wrote:
> Similar to sk/inode/task storage, implement similar cgroup local storage.

> There already exists a local storage implementation for cgroup-attached
> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> bpf_get_local_storage(). But there are use cases such that non-cgroup
> attached bpf progs wants to access cgroup local storage data. For example,
> tc egress prog has access to sk and cgroup. It is possible to use
> sk local storage to emulate cgroup local storage by storing data in  
> socket.
> But this is a waste as it could be lots of sockets belonging to a  
> particular
> cgroup. Alternatively, a separate map can be created with cgroup id as  
> the key.
> But this will introduce additional overhead to manipulate the new map.
> A cgroup local storage, similar to existing sk/inode/task storage,
> should help for this use case.

> The life-cycle of storage is managed with the life-cycle of the
> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> with a callback to the bpf_cgroup_storage_free when cgroup itself
> is deleted.

> The userspace map operations can be done by using a cgroup fd as a key
> passed to the lookup, update and delete operations.


[..]

> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup  
> local
> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is  
> used
> for cgroup storage available to non-cgroup-attached bpf programs. The two
> helpers are named as bpf_cgroup_local_storage_get() and
> bpf_cgroup_local_storage_delete().

Have you considered doing something similar to 7d9c3427894f ("bpf: Make
cgroup storages shared between programs on the same cgroup") where
the map changes its behavior depending on the key size (see key_size checks
in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
can be used so we can, in theory, reuse the name..

Pros:
- no need for a new map name

Cons:
- existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
   good idea to add more stuff to it?

But, for the very least, should we also extend
Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
tried to keep some of the important details in there..

> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>   include/linux/bpf.h             |   3 +
>   include/linux/bpf_types.h       |   1 +
>   include/linux/cgroup-defs.h     |   4 +
>   include/uapi/linux/bpf.h        |  39 +++++
>   kernel/bpf/Makefile             |   2 +-
>   kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
>   kernel/bpf/helpers.c            |   6 +
>   kernel/bpf/syscall.c            |   3 +-
>   kernel/bpf/verifier.c           |  14 +-
>   kernel/cgroup/cgroup.c          |   4 +
>   kernel/trace/bpf_trace.c        |   4 +
>   scripts/bpf_doc.py              |   2 +
>   tools/include/uapi/linux/bpf.h  |  39 +++++
>   13 files changed, 398 insertions(+), 3 deletions(-)
>   create mode 100644 kernel/bpf/bpf_cgroup_storage.c

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e7d46d16032..1395a01c7f18 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);

>   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id  
> func_id);
>   void bpf_task_storage_free(struct task_struct *task);
> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
>   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
>   const struct btf_func_model *
>   bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
> @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto  
> bpf_copy_from_user_task_proto;
>   extern const struct bpf_func_proto bpf_set_retval_proto;
>   extern const struct bpf_func_proto bpf_get_retval_proto;
>   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
> +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;

>   const struct bpf_func_proto *tracing_prog_func_proto(
>     enum bpf_func_id func_id, const struct bpf_prog *prog);
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 2c6a4f2562a7..7a0362d7a0aa 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY,  
> cgroup_array_map_ops)
>   #ifdef CONFIG_CGROUP_BPF
>   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
>   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,  
> cgroup_local_storage_map_ops)
>   #endif
>   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
>   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 4bcf56b3491c..c6f4590dda68 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -504,6 +504,10 @@ struct cgroup {
>   	/* Used to store internal freezer state */
>   	struct cgroup_freezer_state freezer;

> +#ifdef CONFIG_BPF_SYSCALL
> +	struct bpf_local_storage __rcu  *bpf_cgroup_storage;
> +#endif
> +
>   	/* ids of the ancestors at each level including self */
>   	u64 ancestor_ids[];
>   };
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 17f61338f8f8..d918b4054297 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -935,6 +935,7 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_TASK_STORAGE,
>   	BPF_MAP_TYPE_BLOOM_FILTER,
>   	BPF_MAP_TYPE_USER_RINGBUF,
> +	BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
>   };

>   /* Note that tracing related programs such as
> @@ -5435,6 +5436,42 @@ union bpf_attr {
>    *		**-E2BIG** if user-space has tried to publish a sample which is
>    *		larger than the size of the ring buffer, or which cannot fit
>    *		within a struct bpf_dynptr.
> + *
> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup  
> *cgroup, void *value, u64 flags)
> + *	Description
> + *		Get a bpf_local_storage from the *cgroup*.
> + *
> + *		Logically, it could be thought of as getting the value from
> + *		a *map* with *cgroup* as the **key**.  From this
> + *		perspective,  the usage is not much different from
> + *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> + *		helper enforces the key must be a cgroup struct and the map must also
> + *		be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> + *
> + *		Underneath, the value is stored locally at *cgroup* instead of
> + *		the *map*.  The *map* is used as the bpf-local-storage
> + *		"type". The bpf-local-storage "type" (i.e. the *map*) is
> + *		searched against all bpf_local_storage residing at *cgroup*.
> + *
> + *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> + *		used such that a new bpf_local_storage will be
> + *		created if one does not exist.  *value* can be used
> + *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> + *		the initial value of a bpf_local_storage.  If *value* is
> + *		**NULL**, the new bpf_local_storage will be zero initialized.
> + *	Return
> + *		A bpf_local_storage pointer is returned on success.
> + *
> + *		**NULL** if not found or there was an error in adding
> + *		a new bpf_local_storage.
> + *
> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct  
> cgroup *cgroup)
> + *	Description
> + *		Delete a bpf_local_storage from a *cgroup*.
> + *	Return
> + *		0 on success.
> + *
> + *		**-ENOENT** if the bpf_local_storage cannot be found.
>    */
>   #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>   	FN(unspec, 0, ##ctx)				\
> @@ -5647,6 +5684,8 @@ union bpf_attr {
>   	FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)	\
>   	FN(ktime_get_tai_ns, 208, ##ctx)		\
>   	FN(user_ringbuf_drain, 209, ##ctx)		\
> +	FN(cgroup_local_storage_get, 210, ##ctx)	\
> +	FN(cgroup_local_storage_delete, 211, ##ctx)	\
>   	/* */

>   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that  
> don't
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 341c94f208f4..b02693f51978 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
>   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
>   endif
>   ifeq ($(CONFIG_CGROUPS),y)
> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
>   endif
>   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
>   ifeq ($(CONFIG_INET),y)
> diff --git a/kernel/bpf/bpf_cgroup_storage.c  
> b/kernel/bpf/bpf_cgroup_storage.c
> new file mode 100644
> index 000000000000..9974784822da
> --- /dev/null
> +++ b/kernel/bpf/bpf_cgroup_storage.c
> @@ -0,0 +1,280 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/bpf.h>
> +#include <linux/bpf_local_storage.h>
> +#include <uapi/linux/btf.h>
> +#include <linux/btf_ids.h>
> +
> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> +
> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> +
> +static void bpf_cgroup_storage_lock(void)
> +{
> +	migrate_disable();
> +	this_cpu_inc(bpf_cgroup_storage_busy);
> +}
> +
> +static void bpf_cgroup_storage_unlock(void)
> +{
> +	this_cpu_dec(bpf_cgroup_storage_busy);
> +	migrate_enable();
> +}
> +
> +static bool bpf_cgroup_storage_trylock(void)
> +{
> +	migrate_disable();
> +	if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> +		this_cpu_dec(bpf_cgroup_storage_busy);
> +		migrate_enable();
> +		return false;
> +	}
> +	return true;
> +}

Task storage has lock/unlock/trylock; inode storage doesn't; why does
cgroup need it as well?

> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> +{
> +	struct cgroup *cg = owner;
> +
> +	return &cg->bpf_cgroup_storage;
> +}
> +
> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> +{
> +	struct bpf_local_storage *local_storage;
> +	struct bpf_local_storage_elem *selem;
> +	bool free_cgroup_storage = false;
> +	struct hlist_node *n;
> +	unsigned long flags;
> +
> +	rcu_read_lock();
> +	local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> +	if (!local_storage) {
> +		rcu_read_unlock();
> +		return;
> +	}
> +
> +	/* Neither the bpf_prog nor the bpf-map's syscall
> +	 * could be modifying the local_storage->list now.
> +	 * Thus, no elem can be added-to or deleted-from the
> +	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> +	 *
> +	 * It is racing with bpf_local_storage_map_free() alone
> +	 * when unlinking elem from the local_storage->list and
> +	 * the map's bucket->list.
> +	 */
> +	bpf_cgroup_storage_lock();
> +	raw_spin_lock_irqsave(&local_storage->lock, flags);
> +	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> +		bpf_selem_unlink_map(selem);
> +		free_cgroup_storage =
> +			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> +	}
> +	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> +	bpf_cgroup_storage_unlock();
> +	rcu_read_unlock();
> +
> +	/* free_cgroup_storage should always be true as long as
> +	 * local_storage->list was non-empty.
> +	 */
> +	if (free_cgroup_storage)
> +		kfree_rcu(local_storage, rcu);
> +}

> +static struct bpf_local_storage_data *
> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool  
> cacheit_lockit)
> +{
> +	struct bpf_local_storage *cgroup_storage;
> +	struct bpf_local_storage_map *smap;
> +
> +	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> +					       bpf_rcu_lock_held());
> +	if (!cgroup_storage)
> +		return NULL;
> +
> +	smap = (struct bpf_local_storage_map *)map;
> +	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> +}
> +
> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void  
> *key)
> +{
> +	struct bpf_local_storage_data *sdata;
> +	struct cgroup *cgroup;
> +	int fd;
> +
> +	fd = *(int *)key;
> +	cgroup = cgroup_get_from_fd(fd);
> +	if (IS_ERR(cgroup))
> +		return ERR_CAST(cgroup);
> +
> +	bpf_cgroup_storage_lock();
> +	sdata = cgroup_storage_lookup(cgroup, map, true);
> +	bpf_cgroup_storage_unlock();
> +	cgroup_put(cgroup);
> +	return sdata ? sdata->data : NULL;
> +}

A lot of the above (free/lookup) seems to be copy-pasted from the task  
storage;
any point in trying to generalize the common parts?

> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
> +					  void *value, u64 map_flags)
> +{
> +	struct bpf_local_storage_data *sdata;
> +	struct cgroup *cgroup;
> +	int err, fd;
> +
> +	fd = *(int *)key;
> +	cgroup = cgroup_get_from_fd(fd);
> +	if (IS_ERR(cgroup))
> +		return PTR_ERR(cgroup);
> +
> +	bpf_cgroup_storage_lock();
> +	sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map  
> *)map,
> +					 value, map_flags, GFP_ATOMIC);
> +	bpf_cgroup_storage_unlock();
> +	err = PTR_ERR_OR_ZERO(sdata);
> +	cgroup_put(cgroup);
> +	return err;
> +}
> +
> +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map  
> *map)
> +{
> +	struct bpf_local_storage_data *sdata;
> +
> +	sdata = cgroup_storage_lookup(cgroup, map, false);
> +	if (!sdata)
> +		return -ENOENT;
> +
> +	bpf_selem_unlink(SELEM(sdata), true);
> +	return 0;
> +}
> +
> +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key)
> +{
> +	struct cgroup *cgroup;
> +	int err, fd;
> +
> +	fd = *(int *)key;
> +	cgroup = cgroup_get_from_fd(fd);
> +	if (IS_ERR(cgroup))
> +		return PTR_ERR(cgroup);
> +
> +	bpf_cgroup_storage_lock();
> +	err = cgroup_storage_delete(cgroup, map);
> +	bpf_cgroup_storage_unlock();
> +	if (err)
> +		return err;
> +
> +	cgroup_put(cgroup);
> +	return 0;
> +}
> +
> +static int notsupp_get_next_key(struct bpf_map *map, void *key, void  
> *next_key)
> +{
> +	return -ENOTSUPP;
> +}
> +
> +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
> +{
> +	struct bpf_local_storage_map *smap;
> +
> +	smap = bpf_local_storage_map_alloc(attr);
> +	if (IS_ERR(smap))
> +		return ERR_CAST(smap);
> +
> +	smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache);
> +	return &smap->map;
> +}
> +
> +static void cgroup_storage_map_free(struct bpf_map *map)
> +{
> +	struct bpf_local_storage_map *smap;
> +
> +	smap = (struct bpf_local_storage_map *)map;
> +	bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx);
> +	bpf_local_storage_map_free(smap, NULL);
> +}
> +
> +/* *gfp_flags* is a hidden argument provided by the verifier */
> +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup  
> *, cgroup,
> +	   void *, value, u64, flags, gfp_t, gfp_flags)
> +{
> +	struct bpf_local_storage_data *sdata;
> +
> +	WARN_ON_ONCE(!bpf_rcu_lock_held());
> +	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> +		return (unsigned long)NULL;
> +
> +	if (!cgroup)
> +		return (unsigned long)NULL;
> +
> +	if (!bpf_cgroup_storage_trylock())
> +		return (unsigned long)NULL;
> +
> +	sdata = cgroup_storage_lookup(cgroup, map, true);
> +	if (sdata)
> +		goto unlock;
> +
> +	/* only allocate new storage, when the cgroup is refcounted */
> +	if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> +	    (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> +		sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map  
> *)map,
> +						 value, BPF_NOEXIST, gfp_flags);
> +
> +unlock:
> +	bpf_cgroup_storage_unlock();
> +	return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned  
> long)sdata->data;
> +}
> +
> +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct  
> cgroup *, cgroup)
> +{
> +	int ret;
> +
> +	WARN_ON_ONCE(!bpf_rcu_lock_held());
> +	if (!cgroup)
> +		return -EINVAL;
> +
> +	if (!bpf_cgroup_storage_trylock())
> +		return -EBUSY;
> +
> +	ret = cgroup_storage_delete(cgroup, map);
> +	bpf_cgroup_storage_unlock();
> +	return ret;
> +}
> +
> +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct,  
> bpf_local_storage_map)
> +const struct bpf_map_ops cgroup_local_storage_map_ops = {
> +	.map_meta_equal = bpf_map_meta_equal,
> +	.map_alloc_check = bpf_local_storage_map_alloc_check,
> +	.map_alloc = cgroup_storage_map_alloc,
> +	.map_free = cgroup_storage_map_free,
> +	.map_get_next_key = notsupp_get_next_key,
> +	.map_lookup_elem = bpf_cgroup_storage_lookup_elem,
> +	.map_update_elem = bpf_cgroup_storage_update_elem,
> +	.map_delete_elem = bpf_cgroup_storage_delete_elem,
> +	.map_check_btf = bpf_local_storage_map_check_btf,
> +	.map_btf_id = &cgroup_storage_map_btf_ids[0],
> +	.map_owner_storage_ptr = cgroup_storage_ptr,
> +};
> +
> +const struct bpf_func_proto bpf_cgroup_storage_get_proto = {
> +	.func		= bpf_cgroup_storage_get,
> +	.gpl_only	= false,
> +	.ret_type	= RET_PTR_TO_MAP_VALUE_OR_NULL,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_BTF_ID,
> +	.arg2_btf_id	= &bpf_cgroup_btf_id[0],
> +	.arg3_type	= ARG_PTR_TO_MAP_VALUE_OR_NULL,
> +	.arg4_type	= ARG_ANYTHING,
> +};
> +
> +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = {
> +	.func		= bpf_cgroup_storage_delete,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_BTF_ID,
> +	.arg2_btf_id	= &bpf_cgroup_btf_id[0],
> +};
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index a6b04faed282..5c5bb08832ec 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>   		return &bpf_dynptr_write_proto;
>   	case BPF_FUNC_dynptr_data:
>   		return &bpf_dynptr_data_proto;
> +#ifdef CONFIG_CGROUPS
> +	case BPF_FUNC_cgroup_local_storage_get:
> +		return &bpf_cgroup_storage_get_proto;
> +	case BPF_FUNC_cgroup_local_storage_delete:
> +		return &bpf_cgroup_storage_delete_proto;
> +#endif
>   	default:
>   		break;
>   	}
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 7b373a5e861f..e53c7fae6e22 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const  
> struct btf *btf,
>   		    map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
>   		    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
>   		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
> -		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> +		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
> +		    map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
>   			return -ENOTSUPP;
>   		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
>   		    map->value_size) {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 6f6d2d511c06..f36f6a3c0d50 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct  
> bpf_verifier_env *env,
>   		    func_id != BPF_FUNC_task_storage_delete)
>   			goto error;
>   		break;
> +	case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> +		if (func_id != BPF_FUNC_cgroup_local_storage_get &&
> +		    func_id != BPF_FUNC_cgroup_local_storage_delete)
> +			goto error;
> +		break;
>   	case BPF_MAP_TYPE_BLOOM_FILTER:
>   		if (func_id != BPF_FUNC_map_peek_elem &&
>   		    func_id != BPF_FUNC_map_push_elem)
> @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct  
> bpf_verifier_env *env,
>   		if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
>   			goto error;
>   		break;
> +	case BPF_FUNC_cgroup_local_storage_get:
> +	case BPF_FUNC_cgroup_local_storage_delete:
> +		if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> +			goto error;
> +		break;
>   	default:
>   		break;
>   	}
> @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct  
> bpf_verifier_env *env,
>   		case BPF_MAP_TYPE_INODE_STORAGE:
>   		case BPF_MAP_TYPE_SK_STORAGE:
>   		case BPF_MAP_TYPE_TASK_STORAGE:
> +		case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
>   			break;
>   		default:
>   			verbose(env,
> @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env  
> *env)

>   		if (insn->imm == BPF_FUNC_task_storage_get ||
>   		    insn->imm == BPF_FUNC_sk_storage_get ||
> -		    insn->imm == BPF_FUNC_inode_storage_get) {
> +		    insn->imm == BPF_FUNC_inode_storage_get ||
> +		    insn->imm == BPF_FUNC_cgroup_local_storage_get) {
>   			if (env->prog->aux->sleepable)
>   				insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL);
>   			else
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 8ad2c267ff47..2fa2c950c7fb 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
>   		put_css_set_locked(cset->dom_cset);
>   	}

> +#ifdef CONFIG_BPF_SYSCALL
> +	bpf_local_cgroup_storage_free(cset->dfl_cgrp);
> +#endif
> +
>   	kfree_rcu(cset, rcu_head);
>   }

> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 688552df95ca..179adaae4a9f 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id,  
> const struct bpf_prog *prog)
>   		return &bpf_get_current_cgroup_id_proto;
>   	case BPF_FUNC_get_current_ancestor_cgroup_id:
>   		return &bpf_get_current_ancestor_cgroup_id_proto;
> +	case BPF_FUNC_cgroup_local_storage_get:
> +		return &bpf_cgroup_storage_get_proto;
> +	case BPF_FUNC_cgroup_local_storage_delete:
> +		return &bpf_cgroup_storage_delete_proto;
>   #endif
>   	case BPF_FUNC_send_signal:
>   		return &bpf_send_signal_proto;
> diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
> index c0e6690be82a..fdb0aff8cb5a 100755
> --- a/scripts/bpf_doc.py
> +++ b/scripts/bpf_doc.py
> @@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
>               'struct udp6_sock',
>               'struct unix_sock',
>               'struct task_struct',
> +            'struct cgroup',

>               'struct __sk_buff',
>               'struct sk_msg_md',
> @@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
>               'struct udp6_sock',
>               'struct unix_sock',
>               'struct task_struct',
> +            'struct cgroup',
>               'struct path',
>               'struct btf_ptr',
>               'struct inode',
> diff --git a/tools/include/uapi/linux/bpf.h  
> b/tools/include/uapi/linux/bpf.h
> index 17f61338f8f8..d918b4054297 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -935,6 +935,7 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_TASK_STORAGE,
>   	BPF_MAP_TYPE_BLOOM_FILTER,
>   	BPF_MAP_TYPE_USER_RINGBUF,
> +	BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
>   };

>   /* Note that tracing related programs such as
> @@ -5435,6 +5436,42 @@ union bpf_attr {
>    *		**-E2BIG** if user-space has tried to publish a sample which is
>    *		larger than the size of the ring buffer, or which cannot fit
>    *		within a struct bpf_dynptr.
> + *
> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup  
> *cgroup, void *value, u64 flags)
> + *	Description
> + *		Get a bpf_local_storage from the *cgroup*.
> + *
> + *		Logically, it could be thought of as getting the value from
> + *		a *map* with *cgroup* as the **key**.  From this
> + *		perspective,  the usage is not much different from
> + *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> + *		helper enforces the key must be a cgroup struct and the map must also
> + *		be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> + *
> + *		Underneath, the value is stored locally at *cgroup* instead of
> + *		the *map*.  The *map* is used as the bpf-local-storage
> + *		"type". The bpf-local-storage "type" (i.e. the *map*) is
> + *		searched against all bpf_local_storage residing at *cgroup*.
> + *
> + *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> + *		used such that a new bpf_local_storage will be
> + *		created if one does not exist.  *value* can be used
> + *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> + *		the initial value of a bpf_local_storage.  If *value* is
> + *		**NULL**, the new bpf_local_storage will be zero initialized.
> + *	Return
> + *		A bpf_local_storage pointer is returned on success.
> + *
> + *		**NULL** if not found or there was an error in adding
> + *		a new bpf_local_storage.
> + *
> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct  
> cgroup *cgroup)
> + *	Description
> + *		Delete a bpf_local_storage from a *cgroup*.
> + *	Return
> + *		0 on success.
> + *
> + *		**-ENOENT** if the bpf_local_storage cannot be found.
>    */
>   #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>   	FN(unspec, 0, ##ctx)				\
> @@ -5647,6 +5684,8 @@ union bpf_attr {
>   	FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)	\
>   	FN(ktime_get_tai_ns, 208, ##ctx)		\
>   	FN(user_ringbuf_drain, 209, ##ctx)		\
> +	FN(cgroup_local_storage_get, 210, ##ctx)	\
> +	FN(cgroup_local_storage_delete, 211, ##ctx)	\
>   	/* */

>   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that  
> don't
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:01   ` sdf
@ 2022-10-17 18:25     ` Yosry Ahmed
  2022-10-17 18:43       ` Stanislav Fomichev
  2022-10-17 20:10       ` Yonghong Song
  2022-10-17 19:23     ` Yonghong Song
  2022-10-17 22:26     ` Martin KaFai Lau
  2 siblings, 2 replies; 38+ messages in thread
From: Yosry Ahmed @ 2022-10-17 18:25 UTC (permalink / raw)
  To: sdf
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo

On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
>
> On 10/13, Yonghong Song wrote:
> > Similar to sk/inode/task storage, implement similar cgroup local storage.
>
> > There already exists a local storage implementation for cgroup-attached
> > bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> > bpf_get_local_storage(). But there are use cases such that non-cgroup
> > attached bpf progs wants to access cgroup local storage data. For example,
> > tc egress prog has access to sk and cgroup. It is possible to use
> > sk local storage to emulate cgroup local storage by storing data in
> > socket.
> > But this is a waste as it could be lots of sockets belonging to a
> > particular
> > cgroup. Alternatively, a separate map can be created with cgroup id as
> > the key.
> > But this will introduce additional overhead to manipulate the new map.
> > A cgroup local storage, similar to existing sk/inode/task storage,
> > should help for this use case.
>
> > The life-cycle of storage is managed with the life-cycle of the
> > cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> > with a callback to the bpf_cgroup_storage_free when cgroup itself
> > is deleted.
>
> > The userspace map operations can be done by using a cgroup fd as a key
> > passed to the lookup, update and delete operations.
>
>
> [..]
>
> > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> > local
> > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> > used
> > for cgroup storage available to non-cgroup-attached bpf programs. The two
> > helpers are named as bpf_cgroup_local_storage_get() and
> > bpf_cgroup_local_storage_delete().
>
> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> cgroup storages shared between programs on the same cgroup") where
> the map changes its behavior depending on the key size (see key_size checks
> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> can be used so we can, in theory, reuse the name..
>
> Pros:
> - no need for a new map name
>
> Cons:
> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
>    good idea to add more stuff to it?
>
> But, for the very least, should we also extend
> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> tried to keep some of the important details in there..

This might be a long shot, but is it possible to switch completely to
this new generic cgroup storage, and for programs that attach to
cgroups we can still do lookups/allocations during attachment like we
do today? IOW, maintain the current API for cgroup progs but switch it
to use this new map type instead.

It feels like this map type is more generic and can be a superset of
the existing cgroup storage, but I feel like I am missing something.

>
> > Signed-off-by: Yonghong Song <yhs@fb.com>
> > ---
> >   include/linux/bpf.h             |   3 +
> >   include/linux/bpf_types.h       |   1 +
> >   include/linux/cgroup-defs.h     |   4 +
> >   include/uapi/linux/bpf.h        |  39 +++++
> >   kernel/bpf/Makefile             |   2 +-
> >   kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
> >   kernel/bpf/helpers.c            |   6 +
> >   kernel/bpf/syscall.c            |   3 +-
> >   kernel/bpf/verifier.c           |  14 +-
> >   kernel/cgroup/cgroup.c          |   4 +
> >   kernel/trace/bpf_trace.c        |   4 +
> >   scripts/bpf_doc.py              |   2 +
> >   tools/include/uapi/linux/bpf.h  |  39 +++++
> >   13 files changed, 398 insertions(+), 3 deletions(-)
> >   create mode 100644 kernel/bpf/bpf_cgroup_storage.c
>
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 9e7d46d16032..1395a01c7f18 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
>
> >   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id
> > func_id);
> >   void bpf_task_storage_free(struct task_struct *task);
> > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
> >   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
> >   const struct btf_func_model *
> >   bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
> > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto
> > bpf_copy_from_user_task_proto;
> >   extern const struct bpf_func_proto bpf_set_retval_proto;
> >   extern const struct bpf_func_proto bpf_get_retval_proto;
> >   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
> > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
>
> >   const struct bpf_func_proto *tracing_prog_func_proto(
> >     enum bpf_func_id func_id, const struct bpf_prog *prog);
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index 2c6a4f2562a7..7a0362d7a0aa 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY,
> > cgroup_array_map_ops)
> >   #ifdef CONFIG_CGROUP_BPF
> >   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
> >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
> > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > cgroup_local_storage_map_ops)
> >   #endif
> >   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
> >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
> > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > index 4bcf56b3491c..c6f4590dda68 100644
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -504,6 +504,10 @@ struct cgroup {
> >       /* Used to store internal freezer state */
> >       struct cgroup_freezer_state freezer;
>
> > +#ifdef CONFIG_BPF_SYSCALL
> > +     struct bpf_local_storage __rcu  *bpf_cgroup_storage;
> > +#endif
> > +
> >       /* ids of the ancestors at each level including self */
> >       u64 ancestor_ids[];
> >   };
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 17f61338f8f8..d918b4054297 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -935,6 +935,7 @@ enum bpf_map_type {
> >       BPF_MAP_TYPE_TASK_STORAGE,
> >       BPF_MAP_TYPE_BLOOM_FILTER,
> >       BPF_MAP_TYPE_USER_RINGBUF,
> > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> >   };
>
> >   /* Note that tracing related programs such as
> > @@ -5435,6 +5436,42 @@ union bpf_attr {
> >    *          **-E2BIG** if user-space has tried to publish a sample which is
> >    *          larger than the size of the ring buffer, or which cannot fit
> >    *          within a struct bpf_dynptr.
> > + *
> > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > *cgroup, void *value, u64 flags)
> > + *   Description
> > + *           Get a bpf_local_storage from the *cgroup*.
> > + *
> > + *           Logically, it could be thought of as getting the value from
> > + *           a *map* with *cgroup* as the **key**.  From this
> > + *           perspective,  the usage is not much different from
> > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > + *           helper enforces the key must be a cgroup struct and the map must also
> > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > + *
> > + *           Underneath, the value is stored locally at *cgroup* instead of
> > + *           the *map*.  The *map* is used as the bpf-local-storage
> > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > + *           searched against all bpf_local_storage residing at *cgroup*.
> > + *
> > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > + *           used such that a new bpf_local_storage will be
> > + *           created if one does not exist.  *value* can be used
> > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > + *           the initial value of a bpf_local_storage.  If *value* is
> > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > + *   Return
> > + *           A bpf_local_storage pointer is returned on success.
> > + *
> > + *           **NULL** if not found or there was an error in adding
> > + *           a new bpf_local_storage.
> > + *
> > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > cgroup *cgroup)
> > + *   Description
> > + *           Delete a bpf_local_storage from a *cgroup*.
> > + *   Return
> > + *           0 on success.
> > + *
> > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> >    */
> >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> >       FN(unspec, 0, ##ctx)                            \
> > @@ -5647,6 +5684,8 @@ union bpf_attr {
> >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> >       /* */
>
> >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > don't
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index 341c94f208f4..b02693f51978 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> >   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> >   endif
> >   ifeq ($(CONFIG_CGROUPS),y)
> > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
> >   endif
> >   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> >   ifeq ($(CONFIG_INET),y)
> > diff --git a/kernel/bpf/bpf_cgroup_storage.c
> > b/kernel/bpf/bpf_cgroup_storage.c
> > new file mode 100644
> > index 000000000000..9974784822da
> > --- /dev/null
> > +++ b/kernel/bpf/bpf_cgroup_storage.c
> > @@ -0,0 +1,280 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/bpf.h>
> > +#include <linux/bpf_local_storage.h>
> > +#include <uapi/linux/btf.h>
> > +#include <linux/btf_ids.h>
> > +
> > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> > +
> > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> > +
> > +static void bpf_cgroup_storage_lock(void)
> > +{
> > +     migrate_disable();
> > +     this_cpu_inc(bpf_cgroup_storage_busy);
> > +}
> > +
> > +static void bpf_cgroup_storage_unlock(void)
> > +{
> > +     this_cpu_dec(bpf_cgroup_storage_busy);
> > +     migrate_enable();
> > +}
> > +
> > +static bool bpf_cgroup_storage_trylock(void)
> > +{
> > +     migrate_disable();
> > +     if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> > +             this_cpu_dec(bpf_cgroup_storage_busy);
> > +             migrate_enable();
> > +             return false;
> > +     }
> > +     return true;
> > +}
>
> Task storage has lock/unlock/trylock; inode storage doesn't; why does
> cgroup need it as well?
>
> > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> > +{
> > +     struct cgroup *cg = owner;
> > +
> > +     return &cg->bpf_cgroup_storage;
> > +}
> > +
> > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> > +{
> > +     struct bpf_local_storage *local_storage;
> > +     struct bpf_local_storage_elem *selem;
> > +     bool free_cgroup_storage = false;
> > +     struct hlist_node *n;
> > +     unsigned long flags;
> > +
> > +     rcu_read_lock();
> > +     local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> > +     if (!local_storage) {
> > +             rcu_read_unlock();
> > +             return;
> > +     }
> > +
> > +     /* Neither the bpf_prog nor the bpf-map's syscall
> > +      * could be modifying the local_storage->list now.
> > +      * Thus, no elem can be added-to or deleted-from the
> > +      * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> > +      *
> > +      * It is racing with bpf_local_storage_map_free() alone
> > +      * when unlinking elem from the local_storage->list and
> > +      * the map's bucket->list.
> > +      */
> > +     bpf_cgroup_storage_lock();
> > +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > +             bpf_selem_unlink_map(selem);
> > +             free_cgroup_storage =
> > +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > +     }
> > +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > +     bpf_cgroup_storage_unlock();
> > +     rcu_read_unlock();
> > +
> > +     /* free_cgroup_storage should always be true as long as
> > +      * local_storage->list was non-empty.
> > +      */
> > +     if (free_cgroup_storage)
> > +             kfree_rcu(local_storage, rcu);
> > +}
>
> > +static struct bpf_local_storage_data *
> > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool
> > cacheit_lockit)
> > +{
> > +     struct bpf_local_storage *cgroup_storage;
> > +     struct bpf_local_storage_map *smap;
> > +
> > +     cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> > +                                            bpf_rcu_lock_held());
> > +     if (!cgroup_storage)
> > +             return NULL;
> > +
> > +     smap = (struct bpf_local_storage_map *)map;
> > +     return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> > +}
> > +
> > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void
> > *key)
> > +{
> > +     struct bpf_local_storage_data *sdata;
> > +     struct cgroup *cgroup;
> > +     int fd;
> > +
> > +     fd = *(int *)key;
> > +     cgroup = cgroup_get_from_fd(fd);
> > +     if (IS_ERR(cgroup))
> > +             return ERR_CAST(cgroup);
> > +
> > +     bpf_cgroup_storage_lock();
> > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > +     bpf_cgroup_storage_unlock();
> > +     cgroup_put(cgroup);
> > +     return sdata ? sdata->data : NULL;
> > +}
>
> A lot of the above (free/lookup) seems to be copy-pasted from the task
> storage;
> any point in trying to generalize the common parts?
>
> > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
> > +                                       void *value, u64 map_flags)
> > +{
> > +     struct bpf_local_storage_data *sdata;
> > +     struct cgroup *cgroup;
> > +     int err, fd;
> > +
> > +     fd = *(int *)key;
> > +     cgroup = cgroup_get_from_fd(fd);
> > +     if (IS_ERR(cgroup))
> > +             return PTR_ERR(cgroup);
> > +
> > +     bpf_cgroup_storage_lock();
> > +     sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > *)map,
> > +                                      value, map_flags, GFP_ATOMIC);
> > +     bpf_cgroup_storage_unlock();
> > +     err = PTR_ERR_OR_ZERO(sdata);
> > +     cgroup_put(cgroup);
> > +     return err;
> > +}
> > +
> > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map
> > *map)
> > +{
> > +     struct bpf_local_storage_data *sdata;
> > +
> > +     sdata = cgroup_storage_lookup(cgroup, map, false);
> > +     if (!sdata)
> > +             return -ENOENT;
> > +
> > +     bpf_selem_unlink(SELEM(sdata), true);
> > +     return 0;
> > +}
> > +
> > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key)
> > +{
> > +     struct cgroup *cgroup;
> > +     int err, fd;
> > +
> > +     fd = *(int *)key;
> > +     cgroup = cgroup_get_from_fd(fd);
> > +     if (IS_ERR(cgroup))
> > +             return PTR_ERR(cgroup);
> > +
> > +     bpf_cgroup_storage_lock();
> > +     err = cgroup_storage_delete(cgroup, map);
> > +     bpf_cgroup_storage_unlock();
> > +     if (err)
> > +             return err;
> > +
> > +     cgroup_put(cgroup);
> > +     return 0;
> > +}
> > +
> > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void
> > *next_key)
> > +{
> > +     return -ENOTSUPP;
> > +}
> > +
> > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
> > +{
> > +     struct bpf_local_storage_map *smap;
> > +
> > +     smap = bpf_local_storage_map_alloc(attr);
> > +     if (IS_ERR(smap))
> > +             return ERR_CAST(smap);
> > +
> > +     smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache);
> > +     return &smap->map;
> > +}
> > +
> > +static void cgroup_storage_map_free(struct bpf_map *map)
> > +{
> > +     struct bpf_local_storage_map *smap;
> > +
> > +     smap = (struct bpf_local_storage_map *)map;
> > +     bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx);
> > +     bpf_local_storage_map_free(smap, NULL);
> > +}
> > +
> > +/* *gfp_flags* is a hidden argument provided by the verifier */
> > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup
> > *, cgroup,
> > +        void *, value, u64, flags, gfp_t, gfp_flags)
> > +{
> > +     struct bpf_local_storage_data *sdata;
> > +
> > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > +             return (unsigned long)NULL;
> > +
> > +     if (!cgroup)
> > +             return (unsigned long)NULL;
> > +
> > +     if (!bpf_cgroup_storage_trylock())
> > +             return (unsigned long)NULL;
> > +
> > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > +     if (sdata)
> > +             goto unlock;
> > +
> > +     /* only allocate new storage, when the cgroup is refcounted */
> > +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > *)map,
> > +                                              value, BPF_NOEXIST, gfp_flags);
> > +
> > +unlock:
> > +     bpf_cgroup_storage_unlock();
> > +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned
> > long)sdata->data;
> > +}
> > +
> > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct
> > cgroup *, cgroup)
> > +{
> > +     int ret;
> > +
> > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > +     if (!cgroup)
> > +             return -EINVAL;
> > +
> > +     if (!bpf_cgroup_storage_trylock())
> > +             return -EBUSY;
> > +
> > +     ret = cgroup_storage_delete(cgroup, map);
> > +     bpf_cgroup_storage_unlock();
> > +     return ret;
> > +}
> > +
> > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct,
> > bpf_local_storage_map)
> > +const struct bpf_map_ops cgroup_local_storage_map_ops = {
> > +     .map_meta_equal = bpf_map_meta_equal,
> > +     .map_alloc_check = bpf_local_storage_map_alloc_check,
> > +     .map_alloc = cgroup_storage_map_alloc,
> > +     .map_free = cgroup_storage_map_free,
> > +     .map_get_next_key = notsupp_get_next_key,
> > +     .map_lookup_elem = bpf_cgroup_storage_lookup_elem,
> > +     .map_update_elem = bpf_cgroup_storage_update_elem,
> > +     .map_delete_elem = bpf_cgroup_storage_delete_elem,
> > +     .map_check_btf = bpf_local_storage_map_check_btf,
> > +     .map_btf_id = &cgroup_storage_map_btf_ids[0],
> > +     .map_owner_storage_ptr = cgroup_storage_ptr,
> > +};
> > +
> > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = {
> > +     .func           = bpf_cgroup_storage_get,
> > +     .gpl_only       = false,
> > +     .ret_type       = RET_PTR_TO_MAP_VALUE_OR_NULL,
> > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > +     .arg3_type      = ARG_PTR_TO_MAP_VALUE_OR_NULL,
> > +     .arg4_type      = ARG_ANYTHING,
> > +};
> > +
> > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = {
> > +     .func           = bpf_cgroup_storage_delete,
> > +     .gpl_only       = false,
> > +     .ret_type       = RET_INTEGER,
> > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > +};
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index a6b04faed282..5c5bb08832ec 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> >               return &bpf_dynptr_write_proto;
> >       case BPF_FUNC_dynptr_data:
> >               return &bpf_dynptr_data_proto;
> > +#ifdef CONFIG_CGROUPS
> > +     case BPF_FUNC_cgroup_local_storage_get:
> > +             return &bpf_cgroup_storage_get_proto;
> > +     case BPF_FUNC_cgroup_local_storage_delete:
> > +             return &bpf_cgroup_storage_delete_proto;
> > +#endif
> >       default:
> >               break;
> >       }
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 7b373a5e861f..e53c7fae6e22 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const
> > struct btf *btf,
> >                   map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
> >                   map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
> >                   map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
> > -                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> > +                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
> > +                 map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> >                       return -ENOTSUPP;
> >               if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
> >                   map->value_size) {
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 6f6d2d511c06..f36f6a3c0d50 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct
> > bpf_verifier_env *env,
> >                   func_id != BPF_FUNC_task_storage_delete)
> >                       goto error;
> >               break;
> > +     case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> > +             if (func_id != BPF_FUNC_cgroup_local_storage_get &&
> > +                 func_id != BPF_FUNC_cgroup_local_storage_delete)
> > +                     goto error;
> > +             break;
> >       case BPF_MAP_TYPE_BLOOM_FILTER:
> >               if (func_id != BPF_FUNC_map_peek_elem &&
> >                   func_id != BPF_FUNC_map_push_elem)
> > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct
> > bpf_verifier_env *env,
> >               if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> >                       goto error;
> >               break;
> > +     case BPF_FUNC_cgroup_local_storage_get:
> > +     case BPF_FUNC_cgroup_local_storage_delete:
> > +             if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> > +                     goto error;
> > +             break;
> >       default:
> >               break;
> >       }
> > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct
> > bpf_verifier_env *env,
> >               case BPF_MAP_TYPE_INODE_STORAGE:
> >               case BPF_MAP_TYPE_SK_STORAGE:
> >               case BPF_MAP_TYPE_TASK_STORAGE:
> > +             case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> >                       break;
> >               default:
> >                       verbose(env,
> > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env
> > *env)
>
> >               if (insn->imm == BPF_FUNC_task_storage_get ||
> >                   insn->imm == BPF_FUNC_sk_storage_get ||
> > -                 insn->imm == BPF_FUNC_inode_storage_get) {
> > +                 insn->imm == BPF_FUNC_inode_storage_get ||
> > +                 insn->imm == BPF_FUNC_cgroup_local_storage_get) {
> >                       if (env->prog->aux->sleepable)
> >                               insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL);
> >                       else
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 8ad2c267ff47..2fa2c950c7fb 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
> >               put_css_set_locked(cset->dom_cset);
> >       }
>
> > +#ifdef CONFIG_BPF_SYSCALL
> > +     bpf_local_cgroup_storage_free(cset->dfl_cgrp);
> > +#endif
> > +

I am confused about this freeing site. It seems like this path is for
freeing css_set's of task_structs, not for freeing the cgroup itself.
Wouldn't we want to free the local storage when we free the cgroup
itself? Somewhere like css_free_rwork_fn()? or did I completely miss
the point here?

> >       kfree_rcu(cset, rcu_head);
> >   }
>
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index 688552df95ca..179adaae4a9f 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id,
> > const struct bpf_prog *prog)
> >               return &bpf_get_current_cgroup_id_proto;
> >       case BPF_FUNC_get_current_ancestor_cgroup_id:
> >               return &bpf_get_current_ancestor_cgroup_id_proto;
> > +     case BPF_FUNC_cgroup_local_storage_get:
> > +             return &bpf_cgroup_storage_get_proto;
> > +     case BPF_FUNC_cgroup_local_storage_delete:
> > +             return &bpf_cgroup_storage_delete_proto;
> >   #endif
> >       case BPF_FUNC_send_signal:
> >               return &bpf_send_signal_proto;
> > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
> > index c0e6690be82a..fdb0aff8cb5a 100755
> > --- a/scripts/bpf_doc.py
> > +++ b/scripts/bpf_doc.py
> > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
> >               'struct udp6_sock',
> >               'struct unix_sock',
> >               'struct task_struct',
> > +            'struct cgroup',
>
> >               'struct __sk_buff',
> >               'struct sk_msg_md',
> > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
> >               'struct udp6_sock',
> >               'struct unix_sock',
> >               'struct task_struct',
> > +            'struct cgroup',
> >               'struct path',
> >               'struct btf_ptr',
> >               'struct inode',
> > diff --git a/tools/include/uapi/linux/bpf.h
> > b/tools/include/uapi/linux/bpf.h
> > index 17f61338f8f8..d918b4054297 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -935,6 +935,7 @@ enum bpf_map_type {
> >       BPF_MAP_TYPE_TASK_STORAGE,
> >       BPF_MAP_TYPE_BLOOM_FILTER,
> >       BPF_MAP_TYPE_USER_RINGBUF,
> > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> >   };
>
> >   /* Note that tracing related programs such as
> > @@ -5435,6 +5436,42 @@ union bpf_attr {
> >    *          **-E2BIG** if user-space has tried to publish a sample which is
> >    *          larger than the size of the ring buffer, or which cannot fit
> >    *          within a struct bpf_dynptr.
> > + *
> > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > *cgroup, void *value, u64 flags)
> > + *   Description
> > + *           Get a bpf_local_storage from the *cgroup*.
> > + *
> > + *           Logically, it could be thought of as getting the value from
> > + *           a *map* with *cgroup* as the **key**.  From this
> > + *           perspective,  the usage is not much different from
> > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > + *           helper enforces the key must be a cgroup struct and the map must also
> > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > + *
> > + *           Underneath, the value is stored locally at *cgroup* instead of
> > + *           the *map*.  The *map* is used as the bpf-local-storage
> > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > + *           searched against all bpf_local_storage residing at *cgroup*.
> > + *
> > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > + *           used such that a new bpf_local_storage will be
> > + *           created if one does not exist.  *value* can be used
> > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > + *           the initial value of a bpf_local_storage.  If *value* is
> > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > + *   Return
> > + *           A bpf_local_storage pointer is returned on success.
> > + *
> > + *           **NULL** if not found or there was an error in adding
> > + *           a new bpf_local_storage.
> > + *
> > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > cgroup *cgroup)
> > + *   Description
> > + *           Delete a bpf_local_storage from a *cgroup*.
> > + *   Return
> > + *           0 on success.
> > + *
> > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> >    */
> >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> >       FN(unspec, 0, ##ctx)                            \
> > @@ -5647,6 +5684,8 @@ union bpf_attr {
> >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> >       /* */
>
> >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > don't
> > --
> > 2.30.2
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:25     ` Yosry Ahmed
@ 2022-10-17 18:43       ` Stanislav Fomichev
  2022-10-17 18:47         ` Yosry Ahmed
  2022-10-17 20:13         ` Yonghong Song
  2022-10-17 20:10       ` Yonghong Song
  1 sibling, 2 replies; 38+ messages in thread
From: Stanislav Fomichev @ 2022-10-17 18:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo

On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> >
> > On 10/13, Yonghong Song wrote:
> > > Similar to sk/inode/task storage, implement similar cgroup local storage.
> >
> > > There already exists a local storage implementation for cgroup-attached
> > > bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> > > bpf_get_local_storage(). But there are use cases such that non-cgroup
> > > attached bpf progs wants to access cgroup local storage data. For example,
> > > tc egress prog has access to sk and cgroup. It is possible to use
> > > sk local storage to emulate cgroup local storage by storing data in
> > > socket.
> > > But this is a waste as it could be lots of sockets belonging to a
> > > particular
> > > cgroup. Alternatively, a separate map can be created with cgroup id as
> > > the key.
> > > But this will introduce additional overhead to manipulate the new map.
> > > A cgroup local storage, similar to existing sk/inode/task storage,
> > > should help for this use case.
> >
> > > The life-cycle of storage is managed with the life-cycle of the
> > > cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> > > with a callback to the bpf_cgroup_storage_free when cgroup itself
> > > is deleted.
> >
> > > The userspace map operations can be done by using a cgroup fd as a key
> > > passed to the lookup, update and delete operations.
> >
> >
> > [..]
> >
> > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> > > local
> > > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> > > used
> > > for cgroup storage available to non-cgroup-attached bpf programs. The two
> > > helpers are named as bpf_cgroup_local_storage_get() and
> > > bpf_cgroup_local_storage_delete().
> >
> > Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> > cgroup storages shared between programs on the same cgroup") where
> > the map changes its behavior depending on the key size (see key_size checks
> > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> > can be used so we can, in theory, reuse the name..
> >
> > Pros:
> > - no need for a new map name
> >
> > Cons:
> > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> >    good idea to add more stuff to it?
> >
> > But, for the very least, should we also extend
> > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> > tried to keep some of the important details in there..
>
> This might be a long shot, but is it possible to switch completely to
> this new generic cgroup storage, and for programs that attach to
> cgroups we can still do lookups/allocations during attachment like we
> do today? IOW, maintain the current API for cgroup progs but switch it
> to use this new map type instead.
>
> It feels like this map type is more generic and can be a superset of
> the existing cgroup storage, but I feel like I am missing something.

I feel like the biggest issue is that the existing
bpf_get_local_storage helper is guaranteed to always return non-null
and the verifier doesn't require the programs to do null checks on it;
the new helper might return NULL making all existing programs fail the
verifier.

There might be something else I don't remember at this point (besides
that weird per-prog_type that we'd have to emulate as well)..

> >
> > > Signed-off-by: Yonghong Song <yhs@fb.com>
> > > ---
> > >   include/linux/bpf.h             |   3 +
> > >   include/linux/bpf_types.h       |   1 +
> > >   include/linux/cgroup-defs.h     |   4 +
> > >   include/uapi/linux/bpf.h        |  39 +++++
> > >   kernel/bpf/Makefile             |   2 +-
> > >   kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
> > >   kernel/bpf/helpers.c            |   6 +
> > >   kernel/bpf/syscall.c            |   3 +-
> > >   kernel/bpf/verifier.c           |  14 +-
> > >   kernel/cgroup/cgroup.c          |   4 +
> > >   kernel/trace/bpf_trace.c        |   4 +
> > >   scripts/bpf_doc.py              |   2 +
> > >   tools/include/uapi/linux/bpf.h  |  39 +++++
> > >   13 files changed, 398 insertions(+), 3 deletions(-)
> > >   create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 9e7d46d16032..1395a01c7f18 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
> >
> > >   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id
> > > func_id);
> > >   void bpf_task_storage_free(struct task_struct *task);
> > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
> > >   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
> > >   const struct btf_func_model *
> > >   bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
> > > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto
> > > bpf_copy_from_user_task_proto;
> > >   extern const struct bpf_func_proto bpf_set_retval_proto;
> > >   extern const struct bpf_func_proto bpf_get_retval_proto;
> > >   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> > > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
> > > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
> >
> > >   const struct bpf_func_proto *tracing_prog_func_proto(
> > >     enum bpf_func_id func_id, const struct bpf_prog *prog);
> > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > index 2c6a4f2562a7..7a0362d7a0aa 100644
> > > --- a/include/linux/bpf_types.h
> > > +++ b/include/linux/bpf_types.h
> > > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY,
> > > cgroup_array_map_ops)
> > >   #ifdef CONFIG_CGROUP_BPF
> > >   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
> > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
> > > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > cgroup_local_storage_map_ops)
> > >   #endif
> > >   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
> > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
> > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > > index 4bcf56b3491c..c6f4590dda68 100644
> > > --- a/include/linux/cgroup-defs.h
> > > +++ b/include/linux/cgroup-defs.h
> > > @@ -504,6 +504,10 @@ struct cgroup {
> > >       /* Used to store internal freezer state */
> > >       struct cgroup_freezer_state freezer;
> >
> > > +#ifdef CONFIG_BPF_SYSCALL
> > > +     struct bpf_local_storage __rcu  *bpf_cgroup_storage;
> > > +#endif
> > > +
> > >       /* ids of the ancestors at each level including self */
> > >       u64 ancestor_ids[];
> > >   };
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 17f61338f8f8..d918b4054297 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -935,6 +935,7 @@ enum bpf_map_type {
> > >       BPF_MAP_TYPE_TASK_STORAGE,
> > >       BPF_MAP_TYPE_BLOOM_FILTER,
> > >       BPF_MAP_TYPE_USER_RINGBUF,
> > > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > >   };
> >
> > >   /* Note that tracing related programs such as
> > > @@ -5435,6 +5436,42 @@ union bpf_attr {
> > >    *          **-E2BIG** if user-space has tried to publish a sample which is
> > >    *          larger than the size of the ring buffer, or which cannot fit
> > >    *          within a struct bpf_dynptr.
> > > + *
> > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > > *cgroup, void *value, u64 flags)
> > > + *   Description
> > > + *           Get a bpf_local_storage from the *cgroup*.
> > > + *
> > > + *           Logically, it could be thought of as getting the value from
> > > + *           a *map* with *cgroup* as the **key**.  From this
> > > + *           perspective,  the usage is not much different from
> > > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > > + *           helper enforces the key must be a cgroup struct and the map must also
> > > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > > + *
> > > + *           Underneath, the value is stored locally at *cgroup* instead of
> > > + *           the *map*.  The *map* is used as the bpf-local-storage
> > > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > > + *           searched against all bpf_local_storage residing at *cgroup*.
> > > + *
> > > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > > + *           used such that a new bpf_local_storage will be
> > > + *           created if one does not exist.  *value* can be used
> > > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > > + *           the initial value of a bpf_local_storage.  If *value* is
> > > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > > + *   Return
> > > + *           A bpf_local_storage pointer is returned on success.
> > > + *
> > > + *           **NULL** if not found or there was an error in adding
> > > + *           a new bpf_local_storage.
> > > + *
> > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > > cgroup *cgroup)
> > > + *   Description
> > > + *           Delete a bpf_local_storage from a *cgroup*.
> > > + *   Return
> > > + *           0 on success.
> > > + *
> > > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> > >    */
> > >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> > >       FN(unspec, 0, ##ctx)                            \
> > > @@ -5647,6 +5684,8 @@ union bpf_attr {
> > >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> > >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> > >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> > >       /* */
> >
> > >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > > don't
> > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > > index 341c94f208f4..b02693f51978 100644
> > > --- a/kernel/bpf/Makefile
> > > +++ b/kernel/bpf/Makefile
> > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> > >   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> > >   endif
> > >   ifeq ($(CONFIG_CGROUPS),y)
> > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
> > >   endif
> > >   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> > >   ifeq ($(CONFIG_INET),y)
> > > diff --git a/kernel/bpf/bpf_cgroup_storage.c
> > > b/kernel/bpf/bpf_cgroup_storage.c
> > > new file mode 100644
> > > index 000000000000..9974784822da
> > > --- /dev/null
> > > +++ b/kernel/bpf/bpf_cgroup_storage.c
> > > @@ -0,0 +1,280 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> > > + */
> > > +
> > > +#include <linux/types.h>
> > > +#include <linux/bpf.h>
> > > +#include <linux/bpf_local_storage.h>
> > > +#include <uapi/linux/btf.h>
> > > +#include <linux/btf_ids.h>
> > > +
> > > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> > > +
> > > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> > > +
> > > +static void bpf_cgroup_storage_lock(void)
> > > +{
> > > +     migrate_disable();
> > > +     this_cpu_inc(bpf_cgroup_storage_busy);
> > > +}
> > > +
> > > +static void bpf_cgroup_storage_unlock(void)
> > > +{
> > > +     this_cpu_dec(bpf_cgroup_storage_busy);
> > > +     migrate_enable();
> > > +}
> > > +
> > > +static bool bpf_cgroup_storage_trylock(void)
> > > +{
> > > +     migrate_disable();
> > > +     if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> > > +             this_cpu_dec(bpf_cgroup_storage_busy);
> > > +             migrate_enable();
> > > +             return false;
> > > +     }
> > > +     return true;
> > > +}
> >
> > Task storage has lock/unlock/trylock; inode storage doesn't; why does
> > cgroup need it as well?
> >
> > > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> > > +{
> > > +     struct cgroup *cg = owner;
> > > +
> > > +     return &cg->bpf_cgroup_storage;
> > > +}
> > > +
> > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> > > +{
> > > +     struct bpf_local_storage *local_storage;
> > > +     struct bpf_local_storage_elem *selem;
> > > +     bool free_cgroup_storage = false;
> > > +     struct hlist_node *n;
> > > +     unsigned long flags;
> > > +
> > > +     rcu_read_lock();
> > > +     local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> > > +     if (!local_storage) {
> > > +             rcu_read_unlock();
> > > +             return;
> > > +     }
> > > +
> > > +     /* Neither the bpf_prog nor the bpf-map's syscall
> > > +      * could be modifying the local_storage->list now.
> > > +      * Thus, no elem can be added-to or deleted-from the
> > > +      * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> > > +      *
> > > +      * It is racing with bpf_local_storage_map_free() alone
> > > +      * when unlinking elem from the local_storage->list and
> > > +      * the map's bucket->list.
> > > +      */
> > > +     bpf_cgroup_storage_lock();
> > > +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > > +             bpf_selem_unlink_map(selem);
> > > +             free_cgroup_storage =
> > > +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > > +     }
> > > +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > > +     bpf_cgroup_storage_unlock();
> > > +     rcu_read_unlock();
> > > +
> > > +     /* free_cgroup_storage should always be true as long as
> > > +      * local_storage->list was non-empty.
> > > +      */
> > > +     if (free_cgroup_storage)
> > > +             kfree_rcu(local_storage, rcu);
> > > +}
> >
> > > +static struct bpf_local_storage_data *
> > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool
> > > cacheit_lockit)
> > > +{
> > > +     struct bpf_local_storage *cgroup_storage;
> > > +     struct bpf_local_storage_map *smap;
> > > +
> > > +     cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> > > +                                            bpf_rcu_lock_held());
> > > +     if (!cgroup_storage)
> > > +             return NULL;
> > > +
> > > +     smap = (struct bpf_local_storage_map *)map;
> > > +     return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> > > +}
> > > +
> > > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void
> > > *key)
> > > +{
> > > +     struct bpf_local_storage_data *sdata;
> > > +     struct cgroup *cgroup;
> > > +     int fd;
> > > +
> > > +     fd = *(int *)key;
> > > +     cgroup = cgroup_get_from_fd(fd);
> > > +     if (IS_ERR(cgroup))
> > > +             return ERR_CAST(cgroup);
> > > +
> > > +     bpf_cgroup_storage_lock();
> > > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > +     bpf_cgroup_storage_unlock();
> > > +     cgroup_put(cgroup);
> > > +     return sdata ? sdata->data : NULL;
> > > +}
> >
> > A lot of the above (free/lookup) seems to be copy-pasted from the task
> > storage;
> > any point in trying to generalize the common parts?
> >
> > > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
> > > +                                       void *value, u64 map_flags)
> > > +{
> > > +     struct bpf_local_storage_data *sdata;
> > > +     struct cgroup *cgroup;
> > > +     int err, fd;
> > > +
> > > +     fd = *(int *)key;
> > > +     cgroup = cgroup_get_from_fd(fd);
> > > +     if (IS_ERR(cgroup))
> > > +             return PTR_ERR(cgroup);
> > > +
> > > +     bpf_cgroup_storage_lock();
> > > +     sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > > *)map,
> > > +                                      value, map_flags, GFP_ATOMIC);
> > > +     bpf_cgroup_storage_unlock();
> > > +     err = PTR_ERR_OR_ZERO(sdata);
> > > +     cgroup_put(cgroup);
> > > +     return err;
> > > +}
> > > +
> > > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map
> > > *map)
> > > +{
> > > +     struct bpf_local_storage_data *sdata;
> > > +
> > > +     sdata = cgroup_storage_lookup(cgroup, map, false);
> > > +     if (!sdata)
> > > +             return -ENOENT;
> > > +
> > > +     bpf_selem_unlink(SELEM(sdata), true);
> > > +     return 0;
> > > +}
> > > +
> > > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key)
> > > +{
> > > +     struct cgroup *cgroup;
> > > +     int err, fd;
> > > +
> > > +     fd = *(int *)key;
> > > +     cgroup = cgroup_get_from_fd(fd);
> > > +     if (IS_ERR(cgroup))
> > > +             return PTR_ERR(cgroup);
> > > +
> > > +     bpf_cgroup_storage_lock();
> > > +     err = cgroup_storage_delete(cgroup, map);
> > > +     bpf_cgroup_storage_unlock();
> > > +     if (err)
> > > +             return err;
> > > +
> > > +     cgroup_put(cgroup);
> > > +     return 0;
> > > +}
> > > +
> > > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void
> > > *next_key)
> > > +{
> > > +     return -ENOTSUPP;
> > > +}
> > > +
> > > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
> > > +{
> > > +     struct bpf_local_storage_map *smap;
> > > +
> > > +     smap = bpf_local_storage_map_alloc(attr);
> > > +     if (IS_ERR(smap))
> > > +             return ERR_CAST(smap);
> > > +
> > > +     smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache);
> > > +     return &smap->map;
> > > +}
> > > +
> > > +static void cgroup_storage_map_free(struct bpf_map *map)
> > > +{
> > > +     struct bpf_local_storage_map *smap;
> > > +
> > > +     smap = (struct bpf_local_storage_map *)map;
> > > +     bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx);
> > > +     bpf_local_storage_map_free(smap, NULL);
> > > +}
> > > +
> > > +/* *gfp_flags* is a hidden argument provided by the verifier */
> > > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup
> > > *, cgroup,
> > > +        void *, value, u64, flags, gfp_t, gfp_flags)
> > > +{
> > > +     struct bpf_local_storage_data *sdata;
> > > +
> > > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > +             return (unsigned long)NULL;
> > > +
> > > +     if (!cgroup)
> > > +             return (unsigned long)NULL;
> > > +
> > > +     if (!bpf_cgroup_storage_trylock())
> > > +             return (unsigned long)NULL;
> > > +
> > > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > +     if (sdata)
> > > +             goto unlock;
> > > +
> > > +     /* only allocate new storage, when the cgroup is refcounted */
> > > +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > > +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > > *)map,
> > > +                                              value, BPF_NOEXIST, gfp_flags);
> > > +
> > > +unlock:
> > > +     bpf_cgroup_storage_unlock();
> > > +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned
> > > long)sdata->data;
> > > +}
> > > +
> > > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct
> > > cgroup *, cgroup)
> > > +{
> > > +     int ret;
> > > +
> > > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > +     if (!cgroup)
> > > +             return -EINVAL;
> > > +
> > > +     if (!bpf_cgroup_storage_trylock())
> > > +             return -EBUSY;
> > > +
> > > +     ret = cgroup_storage_delete(cgroup, map);
> > > +     bpf_cgroup_storage_unlock();
> > > +     return ret;
> > > +}
> > > +
> > > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct,
> > > bpf_local_storage_map)
> > > +const struct bpf_map_ops cgroup_local_storage_map_ops = {
> > > +     .map_meta_equal = bpf_map_meta_equal,
> > > +     .map_alloc_check = bpf_local_storage_map_alloc_check,
> > > +     .map_alloc = cgroup_storage_map_alloc,
> > > +     .map_free = cgroup_storage_map_free,
> > > +     .map_get_next_key = notsupp_get_next_key,
> > > +     .map_lookup_elem = bpf_cgroup_storage_lookup_elem,
> > > +     .map_update_elem = bpf_cgroup_storage_update_elem,
> > > +     .map_delete_elem = bpf_cgroup_storage_delete_elem,
> > > +     .map_check_btf = bpf_local_storage_map_check_btf,
> > > +     .map_btf_id = &cgroup_storage_map_btf_ids[0],
> > > +     .map_owner_storage_ptr = cgroup_storage_ptr,
> > > +};
> > > +
> > > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = {
> > > +     .func           = bpf_cgroup_storage_get,
> > > +     .gpl_only       = false,
> > > +     .ret_type       = RET_PTR_TO_MAP_VALUE_OR_NULL,
> > > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > > +     .arg3_type      = ARG_PTR_TO_MAP_VALUE_OR_NULL,
> > > +     .arg4_type      = ARG_ANYTHING,
> > > +};
> > > +
> > > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = {
> > > +     .func           = bpf_cgroup_storage_delete,
> > > +     .gpl_only       = false,
> > > +     .ret_type       = RET_INTEGER,
> > > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > > +};
> > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > index a6b04faed282..5c5bb08832ec 100644
> > > --- a/kernel/bpf/helpers.c
> > > +++ b/kernel/bpf/helpers.c
> > > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> > >               return &bpf_dynptr_write_proto;
> > >       case BPF_FUNC_dynptr_data:
> > >               return &bpf_dynptr_data_proto;
> > > +#ifdef CONFIG_CGROUPS
> > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > +             return &bpf_cgroup_storage_get_proto;
> > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > +             return &bpf_cgroup_storage_delete_proto;
> > > +#endif
> > >       default:
> > >               break;
> > >       }
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 7b373a5e861f..e53c7fae6e22 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const
> > > struct btf *btf,
> > >                   map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
> > >                   map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
> > >                   map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
> > > -                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> > > +                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
> > > +                 map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> > >                       return -ENOTSUPP;
> > >               if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
> > >                   map->value_size) {
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 6f6d2d511c06..f36f6a3c0d50 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct
> > > bpf_verifier_env *env,
> > >                   func_id != BPF_FUNC_task_storage_delete)
> > >                       goto error;
> > >               break;
> > > +     case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> > > +             if (func_id != BPF_FUNC_cgroup_local_storage_get &&
> > > +                 func_id != BPF_FUNC_cgroup_local_storage_delete)
> > > +                     goto error;
> > > +             break;
> > >       case BPF_MAP_TYPE_BLOOM_FILTER:
> > >               if (func_id != BPF_FUNC_map_peek_elem &&
> > >                   func_id != BPF_FUNC_map_push_elem)
> > > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct
> > > bpf_verifier_env *env,
> > >               if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> > >                       goto error;
> > >               break;
> > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > +             if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> > > +                     goto error;
> > > +             break;
> > >       default:
> > >               break;
> > >       }
> > > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct
> > > bpf_verifier_env *env,
> > >               case BPF_MAP_TYPE_INODE_STORAGE:
> > >               case BPF_MAP_TYPE_SK_STORAGE:
> > >               case BPF_MAP_TYPE_TASK_STORAGE:
> > > +             case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> > >                       break;
> > >               default:
> > >                       verbose(env,
> > > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env
> > > *env)
> >
> > >               if (insn->imm == BPF_FUNC_task_storage_get ||
> > >                   insn->imm == BPF_FUNC_sk_storage_get ||
> > > -                 insn->imm == BPF_FUNC_inode_storage_get) {
> > > +                 insn->imm == BPF_FUNC_inode_storage_get ||
> > > +                 insn->imm == BPF_FUNC_cgroup_local_storage_get) {
> > >                       if (env->prog->aux->sleepable)
> > >                               insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL);
> > >                       else
> > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > index 8ad2c267ff47..2fa2c950c7fb 100644
> > > --- a/kernel/cgroup/cgroup.c
> > > +++ b/kernel/cgroup/cgroup.c
> > > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
> > >               put_css_set_locked(cset->dom_cset);
> > >       }
> >
> > > +#ifdef CONFIG_BPF_SYSCALL
> > > +     bpf_local_cgroup_storage_free(cset->dfl_cgrp);
> > > +#endif
> > > +
>
> I am confused about this freeing site. It seems like this path is for
> freeing css_set's of task_structs, not for freeing the cgroup itself.
> Wouldn't we want to free the local storage when we free the cgroup
> itself? Somewhere like css_free_rwork_fn()? or did I completely miss
> the point here?
>
> > >       kfree_rcu(cset, rcu_head);
> > >   }
> >
> > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > > index 688552df95ca..179adaae4a9f 100644
> > > --- a/kernel/trace/bpf_trace.c
> > > +++ b/kernel/trace/bpf_trace.c
> > > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id,
> > > const struct bpf_prog *prog)
> > >               return &bpf_get_current_cgroup_id_proto;
> > >       case BPF_FUNC_get_current_ancestor_cgroup_id:
> > >               return &bpf_get_current_ancestor_cgroup_id_proto;
> > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > +             return &bpf_cgroup_storage_get_proto;
> > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > +             return &bpf_cgroup_storage_delete_proto;
> > >   #endif
> > >       case BPF_FUNC_send_signal:
> > >               return &bpf_send_signal_proto;
> > > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
> > > index c0e6690be82a..fdb0aff8cb5a 100755
> > > --- a/scripts/bpf_doc.py
> > > +++ b/scripts/bpf_doc.py
> > > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
> > >               'struct udp6_sock',
> > >               'struct unix_sock',
> > >               'struct task_struct',
> > > +            'struct cgroup',
> >
> > >               'struct __sk_buff',
> > >               'struct sk_msg_md',
> > > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
> > >               'struct udp6_sock',
> > >               'struct unix_sock',
> > >               'struct task_struct',
> > > +            'struct cgroup',
> > >               'struct path',
> > >               'struct btf_ptr',
> > >               'struct inode',
> > > diff --git a/tools/include/uapi/linux/bpf.h
> > > b/tools/include/uapi/linux/bpf.h
> > > index 17f61338f8f8..d918b4054297 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -935,6 +935,7 @@ enum bpf_map_type {
> > >       BPF_MAP_TYPE_TASK_STORAGE,
> > >       BPF_MAP_TYPE_BLOOM_FILTER,
> > >       BPF_MAP_TYPE_USER_RINGBUF,
> > > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > >   };
> >
> > >   /* Note that tracing related programs such as
> > > @@ -5435,6 +5436,42 @@ union bpf_attr {
> > >    *          **-E2BIG** if user-space has tried to publish a sample which is
> > >    *          larger than the size of the ring buffer, or which cannot fit
> > >    *          within a struct bpf_dynptr.
> > > + *
> > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > > *cgroup, void *value, u64 flags)
> > > + *   Description
> > > + *           Get a bpf_local_storage from the *cgroup*.
> > > + *
> > > + *           Logically, it could be thought of as getting the value from
> > > + *           a *map* with *cgroup* as the **key**.  From this
> > > + *           perspective,  the usage is not much different from
> > > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > > + *           helper enforces the key must be a cgroup struct and the map must also
> > > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > > + *
> > > + *           Underneath, the value is stored locally at *cgroup* instead of
> > > + *           the *map*.  The *map* is used as the bpf-local-storage
> > > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > > + *           searched against all bpf_local_storage residing at *cgroup*.
> > > + *
> > > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > > + *           used such that a new bpf_local_storage will be
> > > + *           created if one does not exist.  *value* can be used
> > > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > > + *           the initial value of a bpf_local_storage.  If *value* is
> > > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > > + *   Return
> > > + *           A bpf_local_storage pointer is returned on success.
> > > + *
> > > + *           **NULL** if not found or there was an error in adding
> > > + *           a new bpf_local_storage.
> > > + *
> > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > > cgroup *cgroup)
> > > + *   Description
> > > + *           Delete a bpf_local_storage from a *cgroup*.
> > > + *   Return
> > > + *           0 on success.
> > > + *
> > > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> > >    */
> > >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> > >       FN(unspec, 0, ##ctx)                            \
> > > @@ -5647,6 +5684,8 @@ union bpf_attr {
> > >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> > >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> > >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> > >       /* */
> >
> > >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > > don't
> > > --
> > > 2.30.2
> >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:43       ` Stanislav Fomichev
@ 2022-10-17 18:47         ` Yosry Ahmed
  2022-10-17 19:07           ` Stanislav Fomichev
  2022-10-17 20:15           ` Yonghong Song
  2022-10-17 20:13         ` Yonghong Song
  1 sibling, 2 replies; 38+ messages in thread
From: Yosry Ahmed @ 2022-10-17 18:47 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo

On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> > >
> > > On 10/13, Yonghong Song wrote:
> > > > Similar to sk/inode/task storage, implement similar cgroup local storage.
> > >
> > > > There already exists a local storage implementation for cgroup-attached
> > > > bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> > > > bpf_get_local_storage(). But there are use cases such that non-cgroup
> > > > attached bpf progs wants to access cgroup local storage data. For example,
> > > > tc egress prog has access to sk and cgroup. It is possible to use
> > > > sk local storage to emulate cgroup local storage by storing data in
> > > > socket.
> > > > But this is a waste as it could be lots of sockets belonging to a
> > > > particular
> > > > cgroup. Alternatively, a separate map can be created with cgroup id as
> > > > the key.
> > > > But this will introduce additional overhead to manipulate the new map.
> > > > A cgroup local storage, similar to existing sk/inode/task storage,
> > > > should help for this use case.
> > >
> > > > The life-cycle of storage is managed with the life-cycle of the
> > > > cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> > > > with a callback to the bpf_cgroup_storage_free when cgroup itself
> > > > is deleted.
> > >
> > > > The userspace map operations can be done by using a cgroup fd as a key
> > > > passed to the lookup, update and delete operations.
> > >
> > >
> > > [..]
> > >
> > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> > > > local
> > > > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> > > > used
> > > > for cgroup storage available to non-cgroup-attached bpf programs. The two
> > > > helpers are named as bpf_cgroup_local_storage_get() and
> > > > bpf_cgroup_local_storage_delete().
> > >
> > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> > > cgroup storages shared between programs on the same cgroup") where
> > > the map changes its behavior depending on the key size (see key_size checks
> > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> > > can be used so we can, in theory, reuse the name..
> > >
> > > Pros:
> > > - no need for a new map name
> > >
> > > Cons:
> > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> > >    good idea to add more stuff to it?
> > >
> > > But, for the very least, should we also extend
> > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> > > tried to keep some of the important details in there..
> >
> > This might be a long shot, but is it possible to switch completely to
> > this new generic cgroup storage, and for programs that attach to
> > cgroups we can still do lookups/allocations during attachment like we
> > do today? IOW, maintain the current API for cgroup progs but switch it
> > to use this new map type instead.
> >
> > It feels like this map type is more generic and can be a superset of
> > the existing cgroup storage, but I feel like I am missing something.
>
> I feel like the biggest issue is that the existing
> bpf_get_local_storage helper is guaranteed to always return non-null
> and the verifier doesn't require the programs to do null checks on it;
> the new helper might return NULL making all existing programs fail the
> verifier.

What I meant is, keep the old bpf_get_local_storage helper only for
cgroup-attached programs like we have today, and add a new generic
bpf_cgroup_local_storage_get() helper.

For cgroup-attached programs, make sure a cgroup storage entry is
allocated and hooked to the helper on program attach time, to keep
today's behavior constant.

For other programs, the bpf_cgroup_local_storage_get() will do the
normal lookup and allocate if necessary.

Does this make any sense to you?

>
> There might be something else I don't remember at this point (besides
> that weird per-prog_type that we'd have to emulate as well)..

Yeah there are things that will need to be emulated, but I feel like
we may end up with less confusing code (and less code in general).

>
> > >
> > > > Signed-off-by: Yonghong Song <yhs@fb.com>
> > > > ---
> > > >   include/linux/bpf.h             |   3 +
> > > >   include/linux/bpf_types.h       |   1 +
> > > >   include/linux/cgroup-defs.h     |   4 +
> > > >   include/uapi/linux/bpf.h        |  39 +++++
> > > >   kernel/bpf/Makefile             |   2 +-
> > > >   kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
> > > >   kernel/bpf/helpers.c            |   6 +
> > > >   kernel/bpf/syscall.c            |   3 +-
> > > >   kernel/bpf/verifier.c           |  14 +-
> > > >   kernel/cgroup/cgroup.c          |   4 +
> > > >   kernel/trace/bpf_trace.c        |   4 +
> > > >   scripts/bpf_doc.py              |   2 +
> > > >   tools/include/uapi/linux/bpf.h  |  39 +++++
> > > >   13 files changed, 398 insertions(+), 3 deletions(-)
> > > >   create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 9e7d46d16032..1395a01c7f18 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
> > >
> > > >   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id
> > > > func_id);
> > > >   void bpf_task_storage_free(struct task_struct *task);
> > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
> > > >   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
> > > >   const struct btf_func_model *
> > > >   bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
> > > > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto
> > > > bpf_copy_from_user_task_proto;
> > > >   extern const struct bpf_func_proto bpf_set_retval_proto;
> > > >   extern const struct bpf_func_proto bpf_get_retval_proto;
> > > >   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> > > > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
> > > > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
> > >
> > > >   const struct bpf_func_proto *tracing_prog_func_proto(
> > > >     enum bpf_func_id func_id, const struct bpf_prog *prog);
> > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > > index 2c6a4f2562a7..7a0362d7a0aa 100644
> > > > --- a/include/linux/bpf_types.h
> > > > +++ b/include/linux/bpf_types.h
> > > > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY,
> > > > cgroup_array_map_ops)
> > > >   #ifdef CONFIG_CGROUP_BPF
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
> > > > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > > cgroup_local_storage_map_ops)
> > > >   #endif
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
> > > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > > > index 4bcf56b3491c..c6f4590dda68 100644
> > > > --- a/include/linux/cgroup-defs.h
> > > > +++ b/include/linux/cgroup-defs.h
> > > > @@ -504,6 +504,10 @@ struct cgroup {
> > > >       /* Used to store internal freezer state */
> > > >       struct cgroup_freezer_state freezer;
> > >
> > > > +#ifdef CONFIG_BPF_SYSCALL
> > > > +     struct bpf_local_storage __rcu  *bpf_cgroup_storage;
> > > > +#endif
> > > > +
> > > >       /* ids of the ancestors at each level including self */
> > > >       u64 ancestor_ids[];
> > > >   };
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 17f61338f8f8..d918b4054297 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -935,6 +935,7 @@ enum bpf_map_type {
> > > >       BPF_MAP_TYPE_TASK_STORAGE,
> > > >       BPF_MAP_TYPE_BLOOM_FILTER,
> > > >       BPF_MAP_TYPE_USER_RINGBUF,
> > > > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > >   };
> > >
> > > >   /* Note that tracing related programs such as
> > > > @@ -5435,6 +5436,42 @@ union bpf_attr {
> > > >    *          **-E2BIG** if user-space has tried to publish a sample which is
> > > >    *          larger than the size of the ring buffer, or which cannot fit
> > > >    *          within a struct bpf_dynptr.
> > > > + *
> > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > > > *cgroup, void *value, u64 flags)
> > > > + *   Description
> > > > + *           Get a bpf_local_storage from the *cgroup*.
> > > > + *
> > > > + *           Logically, it could be thought of as getting the value from
> > > > + *           a *map* with *cgroup* as the **key**.  From this
> > > > + *           perspective,  the usage is not much different from
> > > > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > > > + *           helper enforces the key must be a cgroup struct and the map must also
> > > > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > > > + *
> > > > + *           Underneath, the value is stored locally at *cgroup* instead of
> > > > + *           the *map*.  The *map* is used as the bpf-local-storage
> > > > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > > > + *           searched against all bpf_local_storage residing at *cgroup*.
> > > > + *
> > > > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > > > + *           used such that a new bpf_local_storage will be
> > > > + *           created if one does not exist.  *value* can be used
> > > > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > > > + *           the initial value of a bpf_local_storage.  If *value* is
> > > > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > > > + *   Return
> > > > + *           A bpf_local_storage pointer is returned on success.
> > > > + *
> > > > + *           **NULL** if not found or there was an error in adding
> > > > + *           a new bpf_local_storage.
> > > > + *
> > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > > > cgroup *cgroup)
> > > > + *   Description
> > > > + *           Delete a bpf_local_storage from a *cgroup*.
> > > > + *   Return
> > > > + *           0 on success.
> > > > + *
> > > > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> > > >    */
> > > >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> > > >       FN(unspec, 0, ##ctx)                            \
> > > > @@ -5647,6 +5684,8 @@ union bpf_attr {
> > > >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> > > >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> > > >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > > > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > > > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> > > >       /* */
> > >
> > > >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > > > don't
> > > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > > > index 341c94f208f4..b02693f51978 100644
> > > > --- a/kernel/bpf/Makefile
> > > > +++ b/kernel/bpf/Makefile
> > > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> > > >   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> > > >   endif
> > > >   ifeq ($(CONFIG_CGROUPS),y)
> > > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> > > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
> > > >   endif
> > > >   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> > > >   ifeq ($(CONFIG_INET),y)
> > > > diff --git a/kernel/bpf/bpf_cgroup_storage.c
> > > > b/kernel/bpf/bpf_cgroup_storage.c
> > > > new file mode 100644
> > > > index 000000000000..9974784822da
> > > > --- /dev/null
> > > > +++ b/kernel/bpf/bpf_cgroup_storage.c
> > > > @@ -0,0 +1,280 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> > > > + */
> > > > +
> > > > +#include <linux/types.h>
> > > > +#include <linux/bpf.h>
> > > > +#include <linux/bpf_local_storage.h>
> > > > +#include <uapi/linux/btf.h>
> > > > +#include <linux/btf_ids.h>
> > > > +
> > > > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> > > > +
> > > > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> > > > +
> > > > +static void bpf_cgroup_storage_lock(void)
> > > > +{
> > > > +     migrate_disable();
> > > > +     this_cpu_inc(bpf_cgroup_storage_busy);
> > > > +}
> > > > +
> > > > +static void bpf_cgroup_storage_unlock(void)
> > > > +{
> > > > +     this_cpu_dec(bpf_cgroup_storage_busy);
> > > > +     migrate_enable();
> > > > +}
> > > > +
> > > > +static bool bpf_cgroup_storage_trylock(void)
> > > > +{
> > > > +     migrate_disable();
> > > > +     if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> > > > +             this_cpu_dec(bpf_cgroup_storage_busy);
> > > > +             migrate_enable();
> > > > +             return false;
> > > > +     }
> > > > +     return true;
> > > > +}
> > >
> > > Task storage has lock/unlock/trylock; inode storage doesn't; why does
> > > cgroup need it as well?
> > >
> > > > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> > > > +{
> > > > +     struct cgroup *cg = owner;
> > > > +
> > > > +     return &cg->bpf_cgroup_storage;
> > > > +}
> > > > +
> > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> > > > +{
> > > > +     struct bpf_local_storage *local_storage;
> > > > +     struct bpf_local_storage_elem *selem;
> > > > +     bool free_cgroup_storage = false;
> > > > +     struct hlist_node *n;
> > > > +     unsigned long flags;
> > > > +
> > > > +     rcu_read_lock();
> > > > +     local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> > > > +     if (!local_storage) {
> > > > +             rcu_read_unlock();
> > > > +             return;
> > > > +     }
> > > > +
> > > > +     /* Neither the bpf_prog nor the bpf-map's syscall
> > > > +      * could be modifying the local_storage->list now.
> > > > +      * Thus, no elem can be added-to or deleted-from the
> > > > +      * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> > > > +      *
> > > > +      * It is racing with bpf_local_storage_map_free() alone
> > > > +      * when unlinking elem from the local_storage->list and
> > > > +      * the map's bucket->list.
> > > > +      */
> > > > +     bpf_cgroup_storage_lock();
> > > > +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > > +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > > > +             bpf_selem_unlink_map(selem);
> > > > +             free_cgroup_storage =
> > > > +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > > > +     }
> > > > +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > > > +     bpf_cgroup_storage_unlock();
> > > > +     rcu_read_unlock();
> > > > +
> > > > +     /* free_cgroup_storage should always be true as long as
> > > > +      * local_storage->list was non-empty.
> > > > +      */
> > > > +     if (free_cgroup_storage)
> > > > +             kfree_rcu(local_storage, rcu);
> > > > +}
> > >
> > > > +static struct bpf_local_storage_data *
> > > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool
> > > > cacheit_lockit)
> > > > +{
> > > > +     struct bpf_local_storage *cgroup_storage;
> > > > +     struct bpf_local_storage_map *smap;
> > > > +
> > > > +     cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> > > > +                                            bpf_rcu_lock_held());
> > > > +     if (!cgroup_storage)
> > > > +             return NULL;
> > > > +
> > > > +     smap = (struct bpf_local_storage_map *)map;
> > > > +     return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> > > > +}
> > > > +
> > > > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void
> > > > *key)
> > > > +{
> > > > +     struct bpf_local_storage_data *sdata;
> > > > +     struct cgroup *cgroup;
> > > > +     int fd;
> > > > +
> > > > +     fd = *(int *)key;
> > > > +     cgroup = cgroup_get_from_fd(fd);
> > > > +     if (IS_ERR(cgroup))
> > > > +             return ERR_CAST(cgroup);
> > > > +
> > > > +     bpf_cgroup_storage_lock();
> > > > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > > +     bpf_cgroup_storage_unlock();
> > > > +     cgroup_put(cgroup);
> > > > +     return sdata ? sdata->data : NULL;
> > > > +}
> > >
> > > A lot of the above (free/lookup) seems to be copy-pasted from the task
> > > storage;
> > > any point in trying to generalize the common parts?
> > >
> > > > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
> > > > +                                       void *value, u64 map_flags)
> > > > +{
> > > > +     struct bpf_local_storage_data *sdata;
> > > > +     struct cgroup *cgroup;
> > > > +     int err, fd;
> > > > +
> > > > +     fd = *(int *)key;
> > > > +     cgroup = cgroup_get_from_fd(fd);
> > > > +     if (IS_ERR(cgroup))
> > > > +             return PTR_ERR(cgroup);
> > > > +
> > > > +     bpf_cgroup_storage_lock();
> > > > +     sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > > > *)map,
> > > > +                                      value, map_flags, GFP_ATOMIC);
> > > > +     bpf_cgroup_storage_unlock();
> > > > +     err = PTR_ERR_OR_ZERO(sdata);
> > > > +     cgroup_put(cgroup);
> > > > +     return err;
> > > > +}
> > > > +
> > > > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map
> > > > *map)
> > > > +{
> > > > +     struct bpf_local_storage_data *sdata;
> > > > +
> > > > +     sdata = cgroup_storage_lookup(cgroup, map, false);
> > > > +     if (!sdata)
> > > > +             return -ENOENT;
> > > > +
> > > > +     bpf_selem_unlink(SELEM(sdata), true);
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key)
> > > > +{
> > > > +     struct cgroup *cgroup;
> > > > +     int err, fd;
> > > > +
> > > > +     fd = *(int *)key;
> > > > +     cgroup = cgroup_get_from_fd(fd);
> > > > +     if (IS_ERR(cgroup))
> > > > +             return PTR_ERR(cgroup);
> > > > +
> > > > +     bpf_cgroup_storage_lock();
> > > > +     err = cgroup_storage_delete(cgroup, map);
> > > > +     bpf_cgroup_storage_unlock();
> > > > +     if (err)
> > > > +             return err;
> > > > +
> > > > +     cgroup_put(cgroup);
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void
> > > > *next_key)
> > > > +{
> > > > +     return -ENOTSUPP;
> > > > +}
> > > > +
> > > > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
> > > > +{
> > > > +     struct bpf_local_storage_map *smap;
> > > > +
> > > > +     smap = bpf_local_storage_map_alloc(attr);
> > > > +     if (IS_ERR(smap))
> > > > +             return ERR_CAST(smap);
> > > > +
> > > > +     smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache);
> > > > +     return &smap->map;
> > > > +}
> > > > +
> > > > +static void cgroup_storage_map_free(struct bpf_map *map)
> > > > +{
> > > > +     struct bpf_local_storage_map *smap;
> > > > +
> > > > +     smap = (struct bpf_local_storage_map *)map;
> > > > +     bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx);
> > > > +     bpf_local_storage_map_free(smap, NULL);
> > > > +}
> > > > +
> > > > +/* *gfp_flags* is a hidden argument provided by the verifier */
> > > > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup
> > > > *, cgroup,
> > > > +        void *, value, u64, flags, gfp_t, gfp_flags)
> > > > +{
> > > > +     struct bpf_local_storage_data *sdata;
> > > > +
> > > > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > > +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > > +             return (unsigned long)NULL;
> > > > +
> > > > +     if (!cgroup)
> > > > +             return (unsigned long)NULL;
> > > > +
> > > > +     if (!bpf_cgroup_storage_trylock())
> > > > +             return (unsigned long)NULL;
> > > > +
> > > > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > > +     if (sdata)
> > > > +             goto unlock;
> > > > +
> > > > +     /* only allocate new storage, when the cgroup is refcounted */
> > > > +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > > > +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > > +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > > > *)map,
> > > > +                                              value, BPF_NOEXIST, gfp_flags);
> > > > +
> > > > +unlock:
> > > > +     bpf_cgroup_storage_unlock();
> > > > +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned
> > > > long)sdata->data;
> > > > +}
> > > > +
> > > > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct
> > > > cgroup *, cgroup)
> > > > +{
> > > > +     int ret;
> > > > +
> > > > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > > +     if (!cgroup)
> > > > +             return -EINVAL;
> > > > +
> > > > +     if (!bpf_cgroup_storage_trylock())
> > > > +             return -EBUSY;
> > > > +
> > > > +     ret = cgroup_storage_delete(cgroup, map);
> > > > +     bpf_cgroup_storage_unlock();
> > > > +     return ret;
> > > > +}
> > > > +
> > > > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct,
> > > > bpf_local_storage_map)
> > > > +const struct bpf_map_ops cgroup_local_storage_map_ops = {
> > > > +     .map_meta_equal = bpf_map_meta_equal,
> > > > +     .map_alloc_check = bpf_local_storage_map_alloc_check,
> > > > +     .map_alloc = cgroup_storage_map_alloc,
> > > > +     .map_free = cgroup_storage_map_free,
> > > > +     .map_get_next_key = notsupp_get_next_key,
> > > > +     .map_lookup_elem = bpf_cgroup_storage_lookup_elem,
> > > > +     .map_update_elem = bpf_cgroup_storage_update_elem,
> > > > +     .map_delete_elem = bpf_cgroup_storage_delete_elem,
> > > > +     .map_check_btf = bpf_local_storage_map_check_btf,
> > > > +     .map_btf_id = &cgroup_storage_map_btf_ids[0],
> > > > +     .map_owner_storage_ptr = cgroup_storage_ptr,
> > > > +};
> > > > +
> > > > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = {
> > > > +     .func           = bpf_cgroup_storage_get,
> > > > +     .gpl_only       = false,
> > > > +     .ret_type       = RET_PTR_TO_MAP_VALUE_OR_NULL,
> > > > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > > > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > > > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > > > +     .arg3_type      = ARG_PTR_TO_MAP_VALUE_OR_NULL,
> > > > +     .arg4_type      = ARG_ANYTHING,
> > > > +};
> > > > +
> > > > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = {
> > > > +     .func           = bpf_cgroup_storage_delete,
> > > > +     .gpl_only       = false,
> > > > +     .ret_type       = RET_INTEGER,
> > > > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > > > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > > > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > > > +};
> > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > > index a6b04faed282..5c5bb08832ec 100644
> > > > --- a/kernel/bpf/helpers.c
> > > > +++ b/kernel/bpf/helpers.c
> > > > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> > > >               return &bpf_dynptr_write_proto;
> > > >       case BPF_FUNC_dynptr_data:
> > > >               return &bpf_dynptr_data_proto;
> > > > +#ifdef CONFIG_CGROUPS
> > > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > > +             return &bpf_cgroup_storage_get_proto;
> > > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > > +             return &bpf_cgroup_storage_delete_proto;
> > > > +#endif
> > > >       default:
> > > >               break;
> > > >       }
> > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > index 7b373a5e861f..e53c7fae6e22 100644
> > > > --- a/kernel/bpf/syscall.c
> > > > +++ b/kernel/bpf/syscall.c
> > > > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const
> > > > struct btf *btf,
> > > >                   map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
> > > >                   map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
> > > >                   map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
> > > > -                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> > > > +                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
> > > > +                 map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> > > >                       return -ENOTSUPP;
> > > >               if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
> > > >                   map->value_size) {
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 6f6d2d511c06..f36f6a3c0d50 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct
> > > > bpf_verifier_env *env,
> > > >                   func_id != BPF_FUNC_task_storage_delete)
> > > >                       goto error;
> > > >               break;
> > > > +     case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> > > > +             if (func_id != BPF_FUNC_cgroup_local_storage_get &&
> > > > +                 func_id != BPF_FUNC_cgroup_local_storage_delete)
> > > > +                     goto error;
> > > > +             break;
> > > >       case BPF_MAP_TYPE_BLOOM_FILTER:
> > > >               if (func_id != BPF_FUNC_map_peek_elem &&
> > > >                   func_id != BPF_FUNC_map_push_elem)
> > > > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct
> > > > bpf_verifier_env *env,
> > > >               if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> > > >                       goto error;
> > > >               break;
> > > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > > +             if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> > > > +                     goto error;
> > > > +             break;
> > > >       default:
> > > >               break;
> > > >       }
> > > > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct
> > > > bpf_verifier_env *env,
> > > >               case BPF_MAP_TYPE_INODE_STORAGE:
> > > >               case BPF_MAP_TYPE_SK_STORAGE:
> > > >               case BPF_MAP_TYPE_TASK_STORAGE:
> > > > +             case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> > > >                       break;
> > > >               default:
> > > >                       verbose(env,
> > > > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env
> > > > *env)
> > >
> > > >               if (insn->imm == BPF_FUNC_task_storage_get ||
> > > >                   insn->imm == BPF_FUNC_sk_storage_get ||
> > > > -                 insn->imm == BPF_FUNC_inode_storage_get) {
> > > > +                 insn->imm == BPF_FUNC_inode_storage_get ||
> > > > +                 insn->imm == BPF_FUNC_cgroup_local_storage_get) {
> > > >                       if (env->prog->aux->sleepable)
> > > >                               insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL);
> > > >                       else
> > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > > index 8ad2c267ff47..2fa2c950c7fb 100644
> > > > --- a/kernel/cgroup/cgroup.c
> > > > +++ b/kernel/cgroup/cgroup.c
> > > > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
> > > >               put_css_set_locked(cset->dom_cset);
> > > >       }
> > >
> > > > +#ifdef CONFIG_BPF_SYSCALL
> > > > +     bpf_local_cgroup_storage_free(cset->dfl_cgrp);
> > > > +#endif
> > > > +
> >
> > I am confused about this freeing site. It seems like this path is for
> > freeing css_set's of task_structs, not for freeing the cgroup itself.
> > Wouldn't we want to free the local storage when we free the cgroup
> > itself? Somewhere like css_free_rwork_fn()? or did I completely miss
> > the point here?
> >
> > > >       kfree_rcu(cset, rcu_head);
> > > >   }
> > >
> > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > > > index 688552df95ca..179adaae4a9f 100644
> > > > --- a/kernel/trace/bpf_trace.c
> > > > +++ b/kernel/trace/bpf_trace.c
> > > > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id,
> > > > const struct bpf_prog *prog)
> > > >               return &bpf_get_current_cgroup_id_proto;
> > > >       case BPF_FUNC_get_current_ancestor_cgroup_id:
> > > >               return &bpf_get_current_ancestor_cgroup_id_proto;
> > > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > > +             return &bpf_cgroup_storage_get_proto;
> > > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > > +             return &bpf_cgroup_storage_delete_proto;
> > > >   #endif
> > > >       case BPF_FUNC_send_signal:
> > > >               return &bpf_send_signal_proto;
> > > > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
> > > > index c0e6690be82a..fdb0aff8cb5a 100755
> > > > --- a/scripts/bpf_doc.py
> > > > +++ b/scripts/bpf_doc.py
> > > > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
> > > >               'struct udp6_sock',
> > > >               'struct unix_sock',
> > > >               'struct task_struct',
> > > > +            'struct cgroup',
> > >
> > > >               'struct __sk_buff',
> > > >               'struct sk_msg_md',
> > > > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
> > > >               'struct udp6_sock',
> > > >               'struct unix_sock',
> > > >               'struct task_struct',
> > > > +            'struct cgroup',
> > > >               'struct path',
> > > >               'struct btf_ptr',
> > > >               'struct inode',
> > > > diff --git a/tools/include/uapi/linux/bpf.h
> > > > b/tools/include/uapi/linux/bpf.h
> > > > index 17f61338f8f8..d918b4054297 100644
> > > > --- a/tools/include/uapi/linux/bpf.h
> > > > +++ b/tools/include/uapi/linux/bpf.h
> > > > @@ -935,6 +935,7 @@ enum bpf_map_type {
> > > >       BPF_MAP_TYPE_TASK_STORAGE,
> > > >       BPF_MAP_TYPE_BLOOM_FILTER,
> > > >       BPF_MAP_TYPE_USER_RINGBUF,
> > > > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > >   };
> > >
> > > >   /* Note that tracing related programs such as
> > > > @@ -5435,6 +5436,42 @@ union bpf_attr {
> > > >    *          **-E2BIG** if user-space has tried to publish a sample which is
> > > >    *          larger than the size of the ring buffer, or which cannot fit
> > > >    *          within a struct bpf_dynptr.
> > > > + *
> > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > > > *cgroup, void *value, u64 flags)
> > > > + *   Description
> > > > + *           Get a bpf_local_storage from the *cgroup*.
> > > > + *
> > > > + *           Logically, it could be thought of as getting the value from
> > > > + *           a *map* with *cgroup* as the **key**.  From this
> > > > + *           perspective,  the usage is not much different from
> > > > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > > > + *           helper enforces the key must be a cgroup struct and the map must also
> > > > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > > > + *
> > > > + *           Underneath, the value is stored locally at *cgroup* instead of
> > > > + *           the *map*.  The *map* is used as the bpf-local-storage
> > > > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > > > + *           searched against all bpf_local_storage residing at *cgroup*.
> > > > + *
> > > > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > > > + *           used such that a new bpf_local_storage will be
> > > > + *           created if one does not exist.  *value* can be used
> > > > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > > > + *           the initial value of a bpf_local_storage.  If *value* is
> > > > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > > > + *   Return
> > > > + *           A bpf_local_storage pointer is returned on success.
> > > > + *
> > > > + *           **NULL** if not found or there was an error in adding
> > > > + *           a new bpf_local_storage.
> > > > + *
> > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > > > cgroup *cgroup)
> > > > + *   Description
> > > > + *           Delete a bpf_local_storage from a *cgroup*.
> > > > + *   Return
> > > > + *           0 on success.
> > > > + *
> > > > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> > > >    */
> > > >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> > > >       FN(unspec, 0, ##ctx)                            \
> > > > @@ -5647,6 +5684,8 @@ union bpf_attr {
> > > >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> > > >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> > > >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > > > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > > > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> > > >       /* */
> > >
> > > >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > > > don't
> > > > --
> > > > 2.30.2
> > >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:47         ` Yosry Ahmed
@ 2022-10-17 19:07           ` Stanislav Fomichev
  2022-10-17 19:11             ` Yosry Ahmed
  2022-10-17 20:15           ` Yonghong Song
  1 sibling, 1 reply; 38+ messages in thread
From: Stanislav Fomichev @ 2022-10-17 19:07 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo

On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> > > >
> > > > On 10/13, Yonghong Song wrote:
> > > > > Similar to sk/inode/task storage, implement similar cgroup local storage.
> > > >
> > > > > There already exists a local storage implementation for cgroup-attached
> > > > > bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> > > > > bpf_get_local_storage(). But there are use cases such that non-cgroup
> > > > > attached bpf progs wants to access cgroup local storage data. For example,
> > > > > tc egress prog has access to sk and cgroup. It is possible to use
> > > > > sk local storage to emulate cgroup local storage by storing data in
> > > > > socket.
> > > > > But this is a waste as it could be lots of sockets belonging to a
> > > > > particular
> > > > > cgroup. Alternatively, a separate map can be created with cgroup id as
> > > > > the key.
> > > > > But this will introduce additional overhead to manipulate the new map.
> > > > > A cgroup local storage, similar to existing sk/inode/task storage,
> > > > > should help for this use case.
> > > >
> > > > > The life-cycle of storage is managed with the life-cycle of the
> > > > > cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> > > > > with a callback to the bpf_cgroup_storage_free when cgroup itself
> > > > > is deleted.
> > > >
> > > > > The userspace map operations can be done by using a cgroup fd as a key
> > > > > passed to the lookup, update and delete operations.
> > > >
> > > >
> > > > [..]
> > > >
> > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> > > > > local
> > > > > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> > > > > used
> > > > > for cgroup storage available to non-cgroup-attached bpf programs. The two
> > > > > helpers are named as bpf_cgroup_local_storage_get() and
> > > > > bpf_cgroup_local_storage_delete().
> > > >
> > > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> > > > cgroup storages shared between programs on the same cgroup") where
> > > > the map changes its behavior depending on the key size (see key_size checks
> > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> > > > can be used so we can, in theory, reuse the name..
> > > >
> > > > Pros:
> > > > - no need for a new map name
> > > >
> > > > Cons:
> > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> > > >    good idea to add more stuff to it?
> > > >
> > > > But, for the very least, should we also extend
> > > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> > > > tried to keep some of the important details in there..
> > >
> > > This might be a long shot, but is it possible to switch completely to
> > > this new generic cgroup storage, and for programs that attach to
> > > cgroups we can still do lookups/allocations during attachment like we
> > > do today? IOW, maintain the current API for cgroup progs but switch it
> > > to use this new map type instead.
> > >
> > > It feels like this map type is more generic and can be a superset of
> > > the existing cgroup storage, but I feel like I am missing something.
> >
> > I feel like the biggest issue is that the existing
> > bpf_get_local_storage helper is guaranteed to always return non-null
> > and the verifier doesn't require the programs to do null checks on it;
> > the new helper might return NULL making all existing programs fail the
> > verifier.
>
> What I meant is, keep the old bpf_get_local_storage helper only for
> cgroup-attached programs like we have today, and add a new generic
> bpf_cgroup_local_storage_get() helper.
>
> For cgroup-attached programs, make sure a cgroup storage entry is
> allocated and hooked to the helper on program attach time, to keep
> today's behavior constant.
>
> For other programs, the bpf_cgroup_local_storage_get() will do the
> normal lookup and allocate if necessary.
>
> Does this make any sense to you?

But then you also need to somehow mark these to make sure it's not
possible to delete them as long as the program is loaded/attached? Not
saying it's impossible, but it's a bit of a departure from the
existing common local storage framework used by inode/task; not sure
whether we want to pull all this complexity in there? But we can
definitely try if there is a wider agreement..

> > There might be something else I don't remember at this point (besides
> > that weird per-prog_type that we'd have to emulate as well)..
>
> Yeah there are things that will need to be emulated, but I feel like
> we may end up with less confusing code (and less code in general).
>
> >
> > > >
> > > > > Signed-off-by: Yonghong Song <yhs@fb.com>
> > > > > ---
> > > > >   include/linux/bpf.h             |   3 +
> > > > >   include/linux/bpf_types.h       |   1 +
> > > > >   include/linux/cgroup-defs.h     |   4 +
> > > > >   include/uapi/linux/bpf.h        |  39 +++++
> > > > >   kernel/bpf/Makefile             |   2 +-
> > > > >   kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
> > > > >   kernel/bpf/helpers.c            |   6 +
> > > > >   kernel/bpf/syscall.c            |   3 +-
> > > > >   kernel/bpf/verifier.c           |  14 +-
> > > > >   kernel/cgroup/cgroup.c          |   4 +
> > > > >   kernel/trace/bpf_trace.c        |   4 +
> > > > >   scripts/bpf_doc.py              |   2 +
> > > > >   tools/include/uapi/linux/bpf.h  |  39 +++++
> > > > >   13 files changed, 398 insertions(+), 3 deletions(-)
> > > > >   create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> > > >
> > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > index 9e7d46d16032..1395a01c7f18 100644
> > > > > --- a/include/linux/bpf.h
> > > > > +++ b/include/linux/bpf.h
> > > > > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
> > > >
> > > > >   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id
> > > > > func_id);
> > > > >   void bpf_task_storage_free(struct task_struct *task);
> > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
> > > > >   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
> > > > >   const struct btf_func_model *
> > > > >   bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
> > > > > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto
> > > > > bpf_copy_from_user_task_proto;
> > > > >   extern const struct bpf_func_proto bpf_set_retval_proto;
> > > > >   extern const struct bpf_func_proto bpf_get_retval_proto;
> > > > >   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
> > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
> > > >
> > > > >   const struct bpf_func_proto *tracing_prog_func_proto(
> > > > >     enum bpf_func_id func_id, const struct bpf_prog *prog);
> > > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > > > index 2c6a4f2562a7..7a0362d7a0aa 100644
> > > > > --- a/include/linux/bpf_types.h
> > > > > +++ b/include/linux/bpf_types.h
> > > > > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY,
> > > > > cgroup_array_map_ops)
> > > > >   #ifdef CONFIG_CGROUP_BPF
> > > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
> > > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
> > > > > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > > > cgroup_local_storage_map_ops)
> > > > >   #endif
> > > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
> > > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
> > > > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > > > > index 4bcf56b3491c..c6f4590dda68 100644
> > > > > --- a/include/linux/cgroup-defs.h
> > > > > +++ b/include/linux/cgroup-defs.h
> > > > > @@ -504,6 +504,10 @@ struct cgroup {
> > > > >       /* Used to store internal freezer state */
> > > > >       struct cgroup_freezer_state freezer;
> > > >
> > > > > +#ifdef CONFIG_BPF_SYSCALL
> > > > > +     struct bpf_local_storage __rcu  *bpf_cgroup_storage;
> > > > > +#endif
> > > > > +
> > > > >       /* ids of the ancestors at each level including self */
> > > > >       u64 ancestor_ids[];
> > > > >   };
> > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > index 17f61338f8f8..d918b4054297 100644
> > > > > --- a/include/uapi/linux/bpf.h
> > > > > +++ b/include/uapi/linux/bpf.h
> > > > > @@ -935,6 +935,7 @@ enum bpf_map_type {
> > > > >       BPF_MAP_TYPE_TASK_STORAGE,
> > > > >       BPF_MAP_TYPE_BLOOM_FILTER,
> > > > >       BPF_MAP_TYPE_USER_RINGBUF,
> > > > > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > > >   };
> > > >
> > > > >   /* Note that tracing related programs such as
> > > > > @@ -5435,6 +5436,42 @@ union bpf_attr {
> > > > >    *          **-E2BIG** if user-space has tried to publish a sample which is
> > > > >    *          larger than the size of the ring buffer, or which cannot fit
> > > > >    *          within a struct bpf_dynptr.
> > > > > + *
> > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > > > > *cgroup, void *value, u64 flags)
> > > > > + *   Description
> > > > > + *           Get a bpf_local_storage from the *cgroup*.
> > > > > + *
> > > > > + *           Logically, it could be thought of as getting the value from
> > > > > + *           a *map* with *cgroup* as the **key**.  From this
> > > > > + *           perspective,  the usage is not much different from
> > > > > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > > > > + *           helper enforces the key must be a cgroup struct and the map must also
> > > > > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > > > > + *
> > > > > + *           Underneath, the value is stored locally at *cgroup* instead of
> > > > > + *           the *map*.  The *map* is used as the bpf-local-storage
> > > > > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > > > > + *           searched against all bpf_local_storage residing at *cgroup*.
> > > > > + *
> > > > > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > > > > + *           used such that a new bpf_local_storage will be
> > > > > + *           created if one does not exist.  *value* can be used
> > > > > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > > > > + *           the initial value of a bpf_local_storage.  If *value* is
> > > > > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > > > > + *   Return
> > > > > + *           A bpf_local_storage pointer is returned on success.
> > > > > + *
> > > > > + *           **NULL** if not found or there was an error in adding
> > > > > + *           a new bpf_local_storage.
> > > > > + *
> > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > > > > cgroup *cgroup)
> > > > > + *   Description
> > > > > + *           Delete a bpf_local_storage from a *cgroup*.
> > > > > + *   Return
> > > > > + *           0 on success.
> > > > > + *
> > > > > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> > > > >    */
> > > > >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> > > > >       FN(unspec, 0, ##ctx)                            \
> > > > > @@ -5647,6 +5684,8 @@ union bpf_attr {
> > > > >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> > > > >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> > > > >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > > > > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > > > > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> > > > >       /* */
> > > >
> > > > >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > > > > don't
> > > > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > > > > index 341c94f208f4..b02693f51978 100644
> > > > > --- a/kernel/bpf/Makefile
> > > > > +++ b/kernel/bpf/Makefile
> > > > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> > > > >   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> > > > >   endif
> > > > >   ifeq ($(CONFIG_CGROUPS),y)
> > > > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> > > > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
> > > > >   endif
> > > > >   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> > > > >   ifeq ($(CONFIG_INET),y)
> > > > > diff --git a/kernel/bpf/bpf_cgroup_storage.c
> > > > > b/kernel/bpf/bpf_cgroup_storage.c
> > > > > new file mode 100644
> > > > > index 000000000000..9974784822da
> > > > > --- /dev/null
> > > > > +++ b/kernel/bpf/bpf_cgroup_storage.c
> > > > > @@ -0,0 +1,280 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > +/*
> > > > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> > > > > + */
> > > > > +
> > > > > +#include <linux/types.h>
> > > > > +#include <linux/bpf.h>
> > > > > +#include <linux/bpf_local_storage.h>
> > > > > +#include <uapi/linux/btf.h>
> > > > > +#include <linux/btf_ids.h>
> > > > > +
> > > > > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> > > > > +
> > > > > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> > > > > +
> > > > > +static void bpf_cgroup_storage_lock(void)
> > > > > +{
> > > > > +     migrate_disable();
> > > > > +     this_cpu_inc(bpf_cgroup_storage_busy);
> > > > > +}
> > > > > +
> > > > > +static void bpf_cgroup_storage_unlock(void)
> > > > > +{
> > > > > +     this_cpu_dec(bpf_cgroup_storage_busy);
> > > > > +     migrate_enable();
> > > > > +}
> > > > > +
> > > > > +static bool bpf_cgroup_storage_trylock(void)
> > > > > +{
> > > > > +     migrate_disable();
> > > > > +     if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> > > > > +             this_cpu_dec(bpf_cgroup_storage_busy);
> > > > > +             migrate_enable();
> > > > > +             return false;
> > > > > +     }
> > > > > +     return true;
> > > > > +}
> > > >
> > > > Task storage has lock/unlock/trylock; inode storage doesn't; why does
> > > > cgroup need it as well?
> > > >
> > > > > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> > > > > +{
> > > > > +     struct cgroup *cg = owner;
> > > > > +
> > > > > +     return &cg->bpf_cgroup_storage;
> > > > > +}
> > > > > +
> > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> > > > > +{
> > > > > +     struct bpf_local_storage *local_storage;
> > > > > +     struct bpf_local_storage_elem *selem;
> > > > > +     bool free_cgroup_storage = false;
> > > > > +     struct hlist_node *n;
> > > > > +     unsigned long flags;
> > > > > +
> > > > > +     rcu_read_lock();
> > > > > +     local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> > > > > +     if (!local_storage) {
> > > > > +             rcu_read_unlock();
> > > > > +             return;
> > > > > +     }
> > > > > +
> > > > > +     /* Neither the bpf_prog nor the bpf-map's syscall
> > > > > +      * could be modifying the local_storage->list now.
> > > > > +      * Thus, no elem can be added-to or deleted-from the
> > > > > +      * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> > > > > +      *
> > > > > +      * It is racing with bpf_local_storage_map_free() alone
> > > > > +      * when unlinking elem from the local_storage->list and
> > > > > +      * the map's bucket->list.
> > > > > +      */
> > > > > +     bpf_cgroup_storage_lock();
> > > > > +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > > > +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > > > > +             bpf_selem_unlink_map(selem);
> > > > > +             free_cgroup_storage =
> > > > > +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > > > > +     }
> > > > > +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > > > > +     bpf_cgroup_storage_unlock();
> > > > > +     rcu_read_unlock();
> > > > > +
> > > > > +     /* free_cgroup_storage should always be true as long as
> > > > > +      * local_storage->list was non-empty.
> > > > > +      */
> > > > > +     if (free_cgroup_storage)
> > > > > +             kfree_rcu(local_storage, rcu);
> > > > > +}
> > > >
> > > > > +static struct bpf_local_storage_data *
> > > > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool
> > > > > cacheit_lockit)
> > > > > +{
> > > > > +     struct bpf_local_storage *cgroup_storage;
> > > > > +     struct bpf_local_storage_map *smap;
> > > > > +
> > > > > +     cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> > > > > +                                            bpf_rcu_lock_held());
> > > > > +     if (!cgroup_storage)
> > > > > +             return NULL;
> > > > > +
> > > > > +     smap = (struct bpf_local_storage_map *)map;
> > > > > +     return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> > > > > +}
> > > > > +
> > > > > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void
> > > > > *key)
> > > > > +{
> > > > > +     struct bpf_local_storage_data *sdata;
> > > > > +     struct cgroup *cgroup;
> > > > > +     int fd;
> > > > > +
> > > > > +     fd = *(int *)key;
> > > > > +     cgroup = cgroup_get_from_fd(fd);
> > > > > +     if (IS_ERR(cgroup))
> > > > > +             return ERR_CAST(cgroup);
> > > > > +
> > > > > +     bpf_cgroup_storage_lock();
> > > > > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > > > +     bpf_cgroup_storage_unlock();
> > > > > +     cgroup_put(cgroup);
> > > > > +     return sdata ? sdata->data : NULL;
> > > > > +}
> > > >
> > > > A lot of the above (free/lookup) seems to be copy-pasted from the task
> > > > storage;
> > > > any point in trying to generalize the common parts?
> > > >
> > > > > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
> > > > > +                                       void *value, u64 map_flags)
> > > > > +{
> > > > > +     struct bpf_local_storage_data *sdata;
> > > > > +     struct cgroup *cgroup;
> > > > > +     int err, fd;
> > > > > +
> > > > > +     fd = *(int *)key;
> > > > > +     cgroup = cgroup_get_from_fd(fd);
> > > > > +     if (IS_ERR(cgroup))
> > > > > +             return PTR_ERR(cgroup);
> > > > > +
> > > > > +     bpf_cgroup_storage_lock();
> > > > > +     sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > > > > *)map,
> > > > > +                                      value, map_flags, GFP_ATOMIC);
> > > > > +     bpf_cgroup_storage_unlock();
> > > > > +     err = PTR_ERR_OR_ZERO(sdata);
> > > > > +     cgroup_put(cgroup);
> > > > > +     return err;
> > > > > +}
> > > > > +
> > > > > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map
> > > > > *map)
> > > > > +{
> > > > > +     struct bpf_local_storage_data *sdata;
> > > > > +
> > > > > +     sdata = cgroup_storage_lookup(cgroup, map, false);
> > > > > +     if (!sdata)
> > > > > +             return -ENOENT;
> > > > > +
> > > > > +     bpf_selem_unlink(SELEM(sdata), true);
> > > > > +     return 0;
> > > > > +}
> > > > > +
> > > > > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key)
> > > > > +{
> > > > > +     struct cgroup *cgroup;
> > > > > +     int err, fd;
> > > > > +
> > > > > +     fd = *(int *)key;
> > > > > +     cgroup = cgroup_get_from_fd(fd);
> > > > > +     if (IS_ERR(cgroup))
> > > > > +             return PTR_ERR(cgroup);
> > > > > +
> > > > > +     bpf_cgroup_storage_lock();
> > > > > +     err = cgroup_storage_delete(cgroup, map);
> > > > > +     bpf_cgroup_storage_unlock();
> > > > > +     if (err)
> > > > > +             return err;
> > > > > +
> > > > > +     cgroup_put(cgroup);
> > > > > +     return 0;
> > > > > +}
> > > > > +
> > > > > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void
> > > > > *next_key)
> > > > > +{
> > > > > +     return -ENOTSUPP;
> > > > > +}
> > > > > +
> > > > > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
> > > > > +{
> > > > > +     struct bpf_local_storage_map *smap;
> > > > > +
> > > > > +     smap = bpf_local_storage_map_alloc(attr);
> > > > > +     if (IS_ERR(smap))
> > > > > +             return ERR_CAST(smap);
> > > > > +
> > > > > +     smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache);
> > > > > +     return &smap->map;
> > > > > +}
> > > > > +
> > > > > +static void cgroup_storage_map_free(struct bpf_map *map)
> > > > > +{
> > > > > +     struct bpf_local_storage_map *smap;
> > > > > +
> > > > > +     smap = (struct bpf_local_storage_map *)map;
> > > > > +     bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx);
> > > > > +     bpf_local_storage_map_free(smap, NULL);
> > > > > +}
> > > > > +
> > > > > +/* *gfp_flags* is a hidden argument provided by the verifier */
> > > > > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup
> > > > > *, cgroup,
> > > > > +        void *, value, u64, flags, gfp_t, gfp_flags)
> > > > > +{
> > > > > +     struct bpf_local_storage_data *sdata;
> > > > > +
> > > > > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > > > +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > > > +             return (unsigned long)NULL;
> > > > > +
> > > > > +     if (!cgroup)
> > > > > +             return (unsigned long)NULL;
> > > > > +
> > > > > +     if (!bpf_cgroup_storage_trylock())
> > > > > +             return (unsigned long)NULL;
> > > > > +
> > > > > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > > > +     if (sdata)
> > > > > +             goto unlock;
> > > > > +
> > > > > +     /* only allocate new storage, when the cgroup is refcounted */
> > > > > +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > > > > +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > > > +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > > > > *)map,
> > > > > +                                              value, BPF_NOEXIST, gfp_flags);
> > > > > +
> > > > > +unlock:
> > > > > +     bpf_cgroup_storage_unlock();
> > > > > +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned
> > > > > long)sdata->data;
> > > > > +}
> > > > > +
> > > > > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct
> > > > > cgroup *, cgroup)
> > > > > +{
> > > > > +     int ret;
> > > > > +
> > > > > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > > > +     if (!cgroup)
> > > > > +             return -EINVAL;
> > > > > +
> > > > > +     if (!bpf_cgroup_storage_trylock())
> > > > > +             return -EBUSY;
> > > > > +
> > > > > +     ret = cgroup_storage_delete(cgroup, map);
> > > > > +     bpf_cgroup_storage_unlock();
> > > > > +     return ret;
> > > > > +}
> > > > > +
> > > > > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct,
> > > > > bpf_local_storage_map)
> > > > > +const struct bpf_map_ops cgroup_local_storage_map_ops = {
> > > > > +     .map_meta_equal = bpf_map_meta_equal,
> > > > > +     .map_alloc_check = bpf_local_storage_map_alloc_check,
> > > > > +     .map_alloc = cgroup_storage_map_alloc,
> > > > > +     .map_free = cgroup_storage_map_free,
> > > > > +     .map_get_next_key = notsupp_get_next_key,
> > > > > +     .map_lookup_elem = bpf_cgroup_storage_lookup_elem,
> > > > > +     .map_update_elem = bpf_cgroup_storage_update_elem,
> > > > > +     .map_delete_elem = bpf_cgroup_storage_delete_elem,
> > > > > +     .map_check_btf = bpf_local_storage_map_check_btf,
> > > > > +     .map_btf_id = &cgroup_storage_map_btf_ids[0],
> > > > > +     .map_owner_storage_ptr = cgroup_storage_ptr,
> > > > > +};
> > > > > +
> > > > > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = {
> > > > > +     .func           = bpf_cgroup_storage_get,
> > > > > +     .gpl_only       = false,
> > > > > +     .ret_type       = RET_PTR_TO_MAP_VALUE_OR_NULL,
> > > > > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > > > > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > > > > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > > > > +     .arg3_type      = ARG_PTR_TO_MAP_VALUE_OR_NULL,
> > > > > +     .arg4_type      = ARG_ANYTHING,
> > > > > +};
> > > > > +
> > > > > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = {
> > > > > +     .func           = bpf_cgroup_storage_delete,
> > > > > +     .gpl_only       = false,
> > > > > +     .ret_type       = RET_INTEGER,
> > > > > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > > > > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > > > > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > > > > +};
> > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > > > index a6b04faed282..5c5bb08832ec 100644
> > > > > --- a/kernel/bpf/helpers.c
> > > > > +++ b/kernel/bpf/helpers.c
> > > > > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> > > > >               return &bpf_dynptr_write_proto;
> > > > >       case BPF_FUNC_dynptr_data:
> > > > >               return &bpf_dynptr_data_proto;
> > > > > +#ifdef CONFIG_CGROUPS
> > > > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > > > +             return &bpf_cgroup_storage_get_proto;
> > > > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > > > +             return &bpf_cgroup_storage_delete_proto;
> > > > > +#endif
> > > > >       default:
> > > > >               break;
> > > > >       }
> > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > > index 7b373a5e861f..e53c7fae6e22 100644
> > > > > --- a/kernel/bpf/syscall.c
> > > > > +++ b/kernel/bpf/syscall.c
> > > > > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const
> > > > > struct btf *btf,
> > > > >                   map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
> > > > >                   map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
> > > > >                   map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
> > > > > -                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> > > > > +                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
> > > > > +                 map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> > > > >                       return -ENOTSUPP;
> > > > >               if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
> > > > >                   map->value_size) {
> > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > > index 6f6d2d511c06..f36f6a3c0d50 100644
> > > > > --- a/kernel/bpf/verifier.c
> > > > > +++ b/kernel/bpf/verifier.c
> > > > > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct
> > > > > bpf_verifier_env *env,
> > > > >                   func_id != BPF_FUNC_task_storage_delete)
> > > > >                       goto error;
> > > > >               break;
> > > > > +     case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> > > > > +             if (func_id != BPF_FUNC_cgroup_local_storage_get &&
> > > > > +                 func_id != BPF_FUNC_cgroup_local_storage_delete)
> > > > > +                     goto error;
> > > > > +             break;
> > > > >       case BPF_MAP_TYPE_BLOOM_FILTER:
> > > > >               if (func_id != BPF_FUNC_map_peek_elem &&
> > > > >                   func_id != BPF_FUNC_map_push_elem)
> > > > > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct
> > > > > bpf_verifier_env *env,
> > > > >               if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> > > > >                       goto error;
> > > > >               break;
> > > > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > > > +             if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> > > > > +                     goto error;
> > > > > +             break;
> > > > >       default:
> > > > >               break;
> > > > >       }
> > > > > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct
> > > > > bpf_verifier_env *env,
> > > > >               case BPF_MAP_TYPE_INODE_STORAGE:
> > > > >               case BPF_MAP_TYPE_SK_STORAGE:
> > > > >               case BPF_MAP_TYPE_TASK_STORAGE:
> > > > > +             case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> > > > >                       break;
> > > > >               default:
> > > > >                       verbose(env,
> > > > > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env
> > > > > *env)
> > > >
> > > > >               if (insn->imm == BPF_FUNC_task_storage_get ||
> > > > >                   insn->imm == BPF_FUNC_sk_storage_get ||
> > > > > -                 insn->imm == BPF_FUNC_inode_storage_get) {
> > > > > +                 insn->imm == BPF_FUNC_inode_storage_get ||
> > > > > +                 insn->imm == BPF_FUNC_cgroup_local_storage_get) {
> > > > >                       if (env->prog->aux->sleepable)
> > > > >                               insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL);
> > > > >                       else
> > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > > > index 8ad2c267ff47..2fa2c950c7fb 100644
> > > > > --- a/kernel/cgroup/cgroup.c
> > > > > +++ b/kernel/cgroup/cgroup.c
> > > > > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
> > > > >               put_css_set_locked(cset->dom_cset);
> > > > >       }
> > > >
> > > > > +#ifdef CONFIG_BPF_SYSCALL
> > > > > +     bpf_local_cgroup_storage_free(cset->dfl_cgrp);
> > > > > +#endif
> > > > > +
> > >
> > > I am confused about this freeing site. It seems like this path is for
> > > freeing css_set's of task_structs, not for freeing the cgroup itself.
> > > Wouldn't we want to free the local storage when we free the cgroup
> > > itself? Somewhere like css_free_rwork_fn()? or did I completely miss
> > > the point here?
> > >
> > > > >       kfree_rcu(cset, rcu_head);
> > > > >   }
> > > >
> > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > > > > index 688552df95ca..179adaae4a9f 100644
> > > > > --- a/kernel/trace/bpf_trace.c
> > > > > +++ b/kernel/trace/bpf_trace.c
> > > > > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id,
> > > > > const struct bpf_prog *prog)
> > > > >               return &bpf_get_current_cgroup_id_proto;
> > > > >       case BPF_FUNC_get_current_ancestor_cgroup_id:
> > > > >               return &bpf_get_current_ancestor_cgroup_id_proto;
> > > > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > > > +             return &bpf_cgroup_storage_get_proto;
> > > > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > > > +             return &bpf_cgroup_storage_delete_proto;
> > > > >   #endif
> > > > >       case BPF_FUNC_send_signal:
> > > > >               return &bpf_send_signal_proto;
> > > > > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
> > > > > index c0e6690be82a..fdb0aff8cb5a 100755
> > > > > --- a/scripts/bpf_doc.py
> > > > > +++ b/scripts/bpf_doc.py
> > > > > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
> > > > >               'struct udp6_sock',
> > > > >               'struct unix_sock',
> > > > >               'struct task_struct',
> > > > > +            'struct cgroup',
> > > >
> > > > >               'struct __sk_buff',
> > > > >               'struct sk_msg_md',
> > > > > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
> > > > >               'struct udp6_sock',
> > > > >               'struct unix_sock',
> > > > >               'struct task_struct',
> > > > > +            'struct cgroup',
> > > > >               'struct path',
> > > > >               'struct btf_ptr',
> > > > >               'struct inode',
> > > > > diff --git a/tools/include/uapi/linux/bpf.h
> > > > > b/tools/include/uapi/linux/bpf.h
> > > > > index 17f61338f8f8..d918b4054297 100644
> > > > > --- a/tools/include/uapi/linux/bpf.h
> > > > > +++ b/tools/include/uapi/linux/bpf.h
> > > > > @@ -935,6 +935,7 @@ enum bpf_map_type {
> > > > >       BPF_MAP_TYPE_TASK_STORAGE,
> > > > >       BPF_MAP_TYPE_BLOOM_FILTER,
> > > > >       BPF_MAP_TYPE_USER_RINGBUF,
> > > > > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > > >   };
> > > >
> > > > >   /* Note that tracing related programs such as
> > > > > @@ -5435,6 +5436,42 @@ union bpf_attr {
> > > > >    *          **-E2BIG** if user-space has tried to publish a sample which is
> > > > >    *          larger than the size of the ring buffer, or which cannot fit
> > > > >    *          within a struct bpf_dynptr.
> > > > > + *
> > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > > > > *cgroup, void *value, u64 flags)
> > > > > + *   Description
> > > > > + *           Get a bpf_local_storage from the *cgroup*.
> > > > > + *
> > > > > + *           Logically, it could be thought of as getting the value from
> > > > > + *           a *map* with *cgroup* as the **key**.  From this
> > > > > + *           perspective,  the usage is not much different from
> > > > > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > > > > + *           helper enforces the key must be a cgroup struct and the map must also
> > > > > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > > > > + *
> > > > > + *           Underneath, the value is stored locally at *cgroup* instead of
> > > > > + *           the *map*.  The *map* is used as the bpf-local-storage
> > > > > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > > > > + *           searched against all bpf_local_storage residing at *cgroup*.
> > > > > + *
> > > > > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > > > > + *           used such that a new bpf_local_storage will be
> > > > > + *           created if one does not exist.  *value* can be used
> > > > > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > > > > + *           the initial value of a bpf_local_storage.  If *value* is
> > > > > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > > > > + *   Return
> > > > > + *           A bpf_local_storage pointer is returned on success.
> > > > > + *
> > > > > + *           **NULL** if not found or there was an error in adding
> > > > > + *           a new bpf_local_storage.
> > > > > + *
> > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > > > > cgroup *cgroup)
> > > > > + *   Description
> > > > > + *           Delete a bpf_local_storage from a *cgroup*.
> > > > > + *   Return
> > > > > + *           0 on success.
> > > > > + *
> > > > > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> > > > >    */
> > > > >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> > > > >       FN(unspec, 0, ##ctx)                            \
> > > > > @@ -5647,6 +5684,8 @@ union bpf_attr {
> > > > >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> > > > >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> > > > >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > > > > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > > > > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> > > > >       /* */
> > > >
> > > > >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > > > > don't
> > > > > --
> > > > > 2.30.2
> > > >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 19:07           ` Stanislav Fomichev
@ 2022-10-17 19:11             ` Yosry Ahmed
  2022-10-17 19:26               ` Tejun Heo
  2022-10-17 21:07               ` Martin KaFai Lau
  0 siblings, 2 replies; 38+ messages in thread
From: Yosry Ahmed @ 2022-10-17 19:11 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo

On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> > > > >
> > > > > On 10/13, Yonghong Song wrote:
> > > > > > Similar to sk/inode/task storage, implement similar cgroup local storage.
> > > > >
> > > > > > There already exists a local storage implementation for cgroup-attached
> > > > > > bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> > > > > > bpf_get_local_storage(). But there are use cases such that non-cgroup
> > > > > > attached bpf progs wants to access cgroup local storage data. For example,
> > > > > > tc egress prog has access to sk and cgroup. It is possible to use
> > > > > > sk local storage to emulate cgroup local storage by storing data in
> > > > > > socket.
> > > > > > But this is a waste as it could be lots of sockets belonging to a
> > > > > > particular
> > > > > > cgroup. Alternatively, a separate map can be created with cgroup id as
> > > > > > the key.
> > > > > > But this will introduce additional overhead to manipulate the new map.
> > > > > > A cgroup local storage, similar to existing sk/inode/task storage,
> > > > > > should help for this use case.
> > > > >
> > > > > > The life-cycle of storage is managed with the life-cycle of the
> > > > > > cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> > > > > > with a callback to the bpf_cgroup_storage_free when cgroup itself
> > > > > > is deleted.
> > > > >
> > > > > > The userspace map operations can be done by using a cgroup fd as a key
> > > > > > passed to the lookup, update and delete operations.
> > > > >
> > > > >
> > > > > [..]
> > > > >
> > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> > > > > > local
> > > > > > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> > > > > > used
> > > > > > for cgroup storage available to non-cgroup-attached bpf programs. The two
> > > > > > helpers are named as bpf_cgroup_local_storage_get() and
> > > > > > bpf_cgroup_local_storage_delete().
> > > > >
> > > > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> > > > > cgroup storages shared between programs on the same cgroup") where
> > > > > the map changes its behavior depending on the key size (see key_size checks
> > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> > > > > can be used so we can, in theory, reuse the name..
> > > > >
> > > > > Pros:
> > > > > - no need for a new map name
> > > > >
> > > > > Cons:
> > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> > > > >    good idea to add more stuff to it?
> > > > >
> > > > > But, for the very least, should we also extend
> > > > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> > > > > tried to keep some of the important details in there..
> > > >
> > > > This might be a long shot, but is it possible to switch completely to
> > > > this new generic cgroup storage, and for programs that attach to
> > > > cgroups we can still do lookups/allocations during attachment like we
> > > > do today? IOW, maintain the current API for cgroup progs but switch it
> > > > to use this new map type instead.
> > > >
> > > > It feels like this map type is more generic and can be a superset of
> > > > the existing cgroup storage, but I feel like I am missing something.
> > >
> > > I feel like the biggest issue is that the existing
> > > bpf_get_local_storage helper is guaranteed to always return non-null
> > > and the verifier doesn't require the programs to do null checks on it;
> > > the new helper might return NULL making all existing programs fail the
> > > verifier.
> >
> > What I meant is, keep the old bpf_get_local_storage helper only for
> > cgroup-attached programs like we have today, and add a new generic
> > bpf_cgroup_local_storage_get() helper.
> >
> > For cgroup-attached programs, make sure a cgroup storage entry is
> > allocated and hooked to the helper on program attach time, to keep
> > today's behavior constant.
> >
> > For other programs, the bpf_cgroup_local_storage_get() will do the
> > normal lookup and allocate if necessary.
> >
> > Does this make any sense to you?
>
> But then you also need to somehow mark these to make sure it's not
> possible to delete them as long as the program is loaded/attached? Not
> saying it's impossible, but it's a bit of a departure from the
> existing common local storage framework used by inode/task; not sure
> whether we want to pull all this complexity in there? But we can
> definitely try if there is a wider agreement..

I agree that it's not ideal, but it feels like we are comparing two
non-ideal options anyway, I am just throwing ideas around :)

>
> > > There might be something else I don't remember at this point (besides
> > > that weird per-prog_type that we'd have to emulate as well)..
> >
> > Yeah there are things that will need to be emulated, but I feel like
> > we may end up with less confusing code (and less code in general).
> >
> > >
> > > > >
> > > > > > Signed-off-by: Yonghong Song <yhs@fb.com>
> > > > > > ---
> > > > > >   include/linux/bpf.h             |   3 +
> > > > > >   include/linux/bpf_types.h       |   1 +
> > > > > >   include/linux/cgroup-defs.h     |   4 +
> > > > > >   include/uapi/linux/bpf.h        |  39 +++++
> > > > > >   kernel/bpf/Makefile             |   2 +-
> > > > > >   kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
> > > > > >   kernel/bpf/helpers.c            |   6 +
> > > > > >   kernel/bpf/syscall.c            |   3 +-
> > > > > >   kernel/bpf/verifier.c           |  14 +-
> > > > > >   kernel/cgroup/cgroup.c          |   4 +
> > > > > >   kernel/trace/bpf_trace.c        |   4 +
> > > > > >   scripts/bpf_doc.py              |   2 +
> > > > > >   tools/include/uapi/linux/bpf.h  |  39 +++++
> > > > > >   13 files changed, 398 insertions(+), 3 deletions(-)
> > > > > >   create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> > > > >
> > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > > index 9e7d46d16032..1395a01c7f18 100644
> > > > > > --- a/include/linux/bpf.h
> > > > > > +++ b/include/linux/bpf.h
> > > > > > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
> > > > >
> > > > > >   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id
> > > > > > func_id);
> > > > > >   void bpf_task_storage_free(struct task_struct *task);
> > > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
> > > > > >   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
> > > > > >   const struct btf_func_model *
> > > > > >   bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
> > > > > > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto
> > > > > > bpf_copy_from_user_task_proto;
> > > > > >   extern const struct bpf_func_proto bpf_set_retval_proto;
> > > > > >   extern const struct bpf_func_proto bpf_get_retval_proto;
> > > > > >   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> > > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
> > > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
> > > > >
> > > > > >   const struct bpf_func_proto *tracing_prog_func_proto(
> > > > > >     enum bpf_func_id func_id, const struct bpf_prog *prog);
> > > > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > > > > index 2c6a4f2562a7..7a0362d7a0aa 100644
> > > > > > --- a/include/linux/bpf_types.h
> > > > > > +++ b/include/linux/bpf_types.h
> > > > > > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY,
> > > > > > cgroup_array_map_ops)
> > > > > >   #ifdef CONFIG_CGROUP_BPF
> > > > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
> > > > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
> > > > > > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > > > > cgroup_local_storage_map_ops)
> > > > > >   #endif
> > > > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
> > > > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
> > > > > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > > > > > index 4bcf56b3491c..c6f4590dda68 100644
> > > > > > --- a/include/linux/cgroup-defs.h
> > > > > > +++ b/include/linux/cgroup-defs.h
> > > > > > @@ -504,6 +504,10 @@ struct cgroup {
> > > > > >       /* Used to store internal freezer state */
> > > > > >       struct cgroup_freezer_state freezer;
> > > > >
> > > > > > +#ifdef CONFIG_BPF_SYSCALL
> > > > > > +     struct bpf_local_storage __rcu  *bpf_cgroup_storage;
> > > > > > +#endif
> > > > > > +
> > > > > >       /* ids of the ancestors at each level including self */
> > > > > >       u64 ancestor_ids[];
> > > > > >   };
> > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > > index 17f61338f8f8..d918b4054297 100644
> > > > > > --- a/include/uapi/linux/bpf.h
> > > > > > +++ b/include/uapi/linux/bpf.h
> > > > > > @@ -935,6 +935,7 @@ enum bpf_map_type {
> > > > > >       BPF_MAP_TYPE_TASK_STORAGE,
> > > > > >       BPF_MAP_TYPE_BLOOM_FILTER,
> > > > > >       BPF_MAP_TYPE_USER_RINGBUF,
> > > > > > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > > > >   };
> > > > >
> > > > > >   /* Note that tracing related programs such as
> > > > > > @@ -5435,6 +5436,42 @@ union bpf_attr {
> > > > > >    *          **-E2BIG** if user-space has tried to publish a sample which is
> > > > > >    *          larger than the size of the ring buffer, or which cannot fit
> > > > > >    *          within a struct bpf_dynptr.
> > > > > > + *
> > > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > > > > > *cgroup, void *value, u64 flags)
> > > > > > + *   Description
> > > > > > + *           Get a bpf_local_storage from the *cgroup*.
> > > > > > + *
> > > > > > + *           Logically, it could be thought of as getting the value from
> > > > > > + *           a *map* with *cgroup* as the **key**.  From this
> > > > > > + *           perspective,  the usage is not much different from
> > > > > > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > > > > > + *           helper enforces the key must be a cgroup struct and the map must also
> > > > > > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > > > > > + *
> > > > > > + *           Underneath, the value is stored locally at *cgroup* instead of
> > > > > > + *           the *map*.  The *map* is used as the bpf-local-storage
> > > > > > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > > > > > + *           searched against all bpf_local_storage residing at *cgroup*.
> > > > > > + *
> > > > > > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > > > > > + *           used such that a new bpf_local_storage will be
> > > > > > + *           created if one does not exist.  *value* can be used
> > > > > > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > > > > > + *           the initial value of a bpf_local_storage.  If *value* is
> > > > > > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > > > > > + *   Return
> > > > > > + *           A bpf_local_storage pointer is returned on success.
> > > > > > + *
> > > > > > + *           **NULL** if not found or there was an error in adding
> > > > > > + *           a new bpf_local_storage.
> > > > > > + *
> > > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > > > > > cgroup *cgroup)
> > > > > > + *   Description
> > > > > > + *           Delete a bpf_local_storage from a *cgroup*.
> > > > > > + *   Return
> > > > > > + *           0 on success.
> > > > > > + *
> > > > > > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> > > > > >    */
> > > > > >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> > > > > >       FN(unspec, 0, ##ctx)                            \
> > > > > > @@ -5647,6 +5684,8 @@ union bpf_attr {
> > > > > >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> > > > > >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> > > > > >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > > > > > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > > > > > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> > > > > >       /* */
> > > > >
> > > > > >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > > > > > don't
> > > > > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > > > > > index 341c94f208f4..b02693f51978 100644
> > > > > > --- a/kernel/bpf/Makefile
> > > > > > +++ b/kernel/bpf/Makefile
> > > > > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> > > > > >   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> > > > > >   endif
> > > > > >   ifeq ($(CONFIG_CGROUPS),y)
> > > > > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> > > > > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
> > > > > >   endif
> > > > > >   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> > > > > >   ifeq ($(CONFIG_INET),y)
> > > > > > diff --git a/kernel/bpf/bpf_cgroup_storage.c
> > > > > > b/kernel/bpf/bpf_cgroup_storage.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..9974784822da
> > > > > > --- /dev/null
> > > > > > +++ b/kernel/bpf/bpf_cgroup_storage.c
> > > > > > @@ -0,0 +1,280 @@
> > > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > > +/*
> > > > > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> > > > > > + */
> > > > > > +
> > > > > > +#include <linux/types.h>
> > > > > > +#include <linux/bpf.h>
> > > > > > +#include <linux/bpf_local_storage.h>
> > > > > > +#include <uapi/linux/btf.h>
> > > > > > +#include <linux/btf_ids.h>
> > > > > > +
> > > > > > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> > > > > > +
> > > > > > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> > > > > > +
> > > > > > +static void bpf_cgroup_storage_lock(void)
> > > > > > +{
> > > > > > +     migrate_disable();
> > > > > > +     this_cpu_inc(bpf_cgroup_storage_busy);
> > > > > > +}
> > > > > > +
> > > > > > +static void bpf_cgroup_storage_unlock(void)
> > > > > > +{
> > > > > > +     this_cpu_dec(bpf_cgroup_storage_busy);
> > > > > > +     migrate_enable();
> > > > > > +}
> > > > > > +
> > > > > > +static bool bpf_cgroup_storage_trylock(void)
> > > > > > +{
> > > > > > +     migrate_disable();
> > > > > > +     if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> > > > > > +             this_cpu_dec(bpf_cgroup_storage_busy);
> > > > > > +             migrate_enable();
> > > > > > +             return false;
> > > > > > +     }
> > > > > > +     return true;
> > > > > > +}
> > > > >
> > > > > Task storage has lock/unlock/trylock; inode storage doesn't; why does
> > > > > cgroup need it as well?
> > > > >
> > > > > > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> > > > > > +{
> > > > > > +     struct cgroup *cg = owner;
> > > > > > +
> > > > > > +     return &cg->bpf_cgroup_storage;
> > > > > > +}
> > > > > > +
> > > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> > > > > > +{
> > > > > > +     struct bpf_local_storage *local_storage;
> > > > > > +     struct bpf_local_storage_elem *selem;
> > > > > > +     bool free_cgroup_storage = false;
> > > > > > +     struct hlist_node *n;
> > > > > > +     unsigned long flags;
> > > > > > +
> > > > > > +     rcu_read_lock();
> > > > > > +     local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> > > > > > +     if (!local_storage) {
> > > > > > +             rcu_read_unlock();
> > > > > > +             return;
> > > > > > +     }
> > > > > > +
> > > > > > +     /* Neither the bpf_prog nor the bpf-map's syscall
> > > > > > +      * could be modifying the local_storage->list now.
> > > > > > +      * Thus, no elem can be added-to or deleted-from the
> > > > > > +      * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> > > > > > +      *
> > > > > > +      * It is racing with bpf_local_storage_map_free() alone
> > > > > > +      * when unlinking elem from the local_storage->list and
> > > > > > +      * the map's bucket->list.
> > > > > > +      */
> > > > > > +     bpf_cgroup_storage_lock();
> > > > > > +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > > > > +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > > > > > +             bpf_selem_unlink_map(selem);
> > > > > > +             free_cgroup_storage =
> > > > > > +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > > > > > +     }
> > > > > > +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > > > > > +     bpf_cgroup_storage_unlock();
> > > > > > +     rcu_read_unlock();
> > > > > > +
> > > > > > +     /* free_cgroup_storage should always be true as long as
> > > > > > +      * local_storage->list was non-empty.
> > > > > > +      */
> > > > > > +     if (free_cgroup_storage)
> > > > > > +             kfree_rcu(local_storage, rcu);
> > > > > > +}
> > > > >
> > > > > > +static struct bpf_local_storage_data *
> > > > > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool
> > > > > > cacheit_lockit)
> > > > > > +{
> > > > > > +     struct bpf_local_storage *cgroup_storage;
> > > > > > +     struct bpf_local_storage_map *smap;
> > > > > > +
> > > > > > +     cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> > > > > > +                                            bpf_rcu_lock_held());
> > > > > > +     if (!cgroup_storage)
> > > > > > +             return NULL;
> > > > > > +
> > > > > > +     smap = (struct bpf_local_storage_map *)map;
> > > > > > +     return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> > > > > > +}
> > > > > > +
> > > > > > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void
> > > > > > *key)
> > > > > > +{
> > > > > > +     struct bpf_local_storage_data *sdata;
> > > > > > +     struct cgroup *cgroup;
> > > > > > +     int fd;
> > > > > > +
> > > > > > +     fd = *(int *)key;
> > > > > > +     cgroup = cgroup_get_from_fd(fd);
> > > > > > +     if (IS_ERR(cgroup))
> > > > > > +             return ERR_CAST(cgroup);
> > > > > > +
> > > > > > +     bpf_cgroup_storage_lock();
> > > > > > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > > > > +     bpf_cgroup_storage_unlock();
> > > > > > +     cgroup_put(cgroup);
> > > > > > +     return sdata ? sdata->data : NULL;
> > > > > > +}
> > > > >
> > > > > A lot of the above (free/lookup) seems to be copy-pasted from the task
> > > > > storage;
> > > > > any point in trying to generalize the common parts?
> > > > >
> > > > > > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
> > > > > > +                                       void *value, u64 map_flags)
> > > > > > +{
> > > > > > +     struct bpf_local_storage_data *sdata;
> > > > > > +     struct cgroup *cgroup;
> > > > > > +     int err, fd;
> > > > > > +
> > > > > > +     fd = *(int *)key;
> > > > > > +     cgroup = cgroup_get_from_fd(fd);
> > > > > > +     if (IS_ERR(cgroup))
> > > > > > +             return PTR_ERR(cgroup);
> > > > > > +
> > > > > > +     bpf_cgroup_storage_lock();
> > > > > > +     sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > > > > > *)map,
> > > > > > +                                      value, map_flags, GFP_ATOMIC);
> > > > > > +     bpf_cgroup_storage_unlock();
> > > > > > +     err = PTR_ERR_OR_ZERO(sdata);
> > > > > > +     cgroup_put(cgroup);
> > > > > > +     return err;
> > > > > > +}
> > > > > > +
> > > > > > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map
> > > > > > *map)
> > > > > > +{
> > > > > > +     struct bpf_local_storage_data *sdata;
> > > > > > +
> > > > > > +     sdata = cgroup_storage_lookup(cgroup, map, false);
> > > > > > +     if (!sdata)
> > > > > > +             return -ENOENT;
> > > > > > +
> > > > > > +     bpf_selem_unlink(SELEM(sdata), true);
> > > > > > +     return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key)
> > > > > > +{
> > > > > > +     struct cgroup *cgroup;
> > > > > > +     int err, fd;
> > > > > > +
> > > > > > +     fd = *(int *)key;
> > > > > > +     cgroup = cgroup_get_from_fd(fd);
> > > > > > +     if (IS_ERR(cgroup))
> > > > > > +             return PTR_ERR(cgroup);
> > > > > > +
> > > > > > +     bpf_cgroup_storage_lock();
> > > > > > +     err = cgroup_storage_delete(cgroup, map);
> > > > > > +     bpf_cgroup_storage_unlock();
> > > > > > +     if (err)
> > > > > > +             return err;
> > > > > > +
> > > > > > +     cgroup_put(cgroup);
> > > > > > +     return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void
> > > > > > *next_key)
> > > > > > +{
> > > > > > +     return -ENOTSUPP;
> > > > > > +}
> > > > > > +
> > > > > > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
> > > > > > +{
> > > > > > +     struct bpf_local_storage_map *smap;
> > > > > > +
> > > > > > +     smap = bpf_local_storage_map_alloc(attr);
> > > > > > +     if (IS_ERR(smap))
> > > > > > +             return ERR_CAST(smap);
> > > > > > +
> > > > > > +     smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache);
> > > > > > +     return &smap->map;
> > > > > > +}
> > > > > > +
> > > > > > +static void cgroup_storage_map_free(struct bpf_map *map)
> > > > > > +{
> > > > > > +     struct bpf_local_storage_map *smap;
> > > > > > +
> > > > > > +     smap = (struct bpf_local_storage_map *)map;
> > > > > > +     bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx);
> > > > > > +     bpf_local_storage_map_free(smap, NULL);
> > > > > > +}
> > > > > > +
> > > > > > +/* *gfp_flags* is a hidden argument provided by the verifier */
> > > > > > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup
> > > > > > *, cgroup,
> > > > > > +        void *, value, u64, flags, gfp_t, gfp_flags)
> > > > > > +{
> > > > > > +     struct bpf_local_storage_data *sdata;
> > > > > > +
> > > > > > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > > > > +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > > > > +             return (unsigned long)NULL;
> > > > > > +
> > > > > > +     if (!cgroup)
> > > > > > +             return (unsigned long)NULL;
> > > > > > +
> > > > > > +     if (!bpf_cgroup_storage_trylock())
> > > > > > +             return (unsigned long)NULL;
> > > > > > +
> > > > > > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > > > > +     if (sdata)
> > > > > > +             goto unlock;
> > > > > > +
> > > > > > +     /* only allocate new storage, when the cgroup is refcounted */
> > > > > > +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > > > > > +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > > > > +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> > > > > > *)map,
> > > > > > +                                              value, BPF_NOEXIST, gfp_flags);
> > > > > > +
> > > > > > +unlock:
> > > > > > +     bpf_cgroup_storage_unlock();
> > > > > > +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned
> > > > > > long)sdata->data;
> > > > > > +}
> > > > > > +
> > > > > > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct
> > > > > > cgroup *, cgroup)
> > > > > > +{
> > > > > > +     int ret;
> > > > > > +
> > > > > > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > > > > +     if (!cgroup)
> > > > > > +             return -EINVAL;
> > > > > > +
> > > > > > +     if (!bpf_cgroup_storage_trylock())
> > > > > > +             return -EBUSY;
> > > > > > +
> > > > > > +     ret = cgroup_storage_delete(cgroup, map);
> > > > > > +     bpf_cgroup_storage_unlock();
> > > > > > +     return ret;
> > > > > > +}
> > > > > > +
> > > > > > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct,
> > > > > > bpf_local_storage_map)
> > > > > > +const struct bpf_map_ops cgroup_local_storage_map_ops = {
> > > > > > +     .map_meta_equal = bpf_map_meta_equal,
> > > > > > +     .map_alloc_check = bpf_local_storage_map_alloc_check,
> > > > > > +     .map_alloc = cgroup_storage_map_alloc,
> > > > > > +     .map_free = cgroup_storage_map_free,
> > > > > > +     .map_get_next_key = notsupp_get_next_key,
> > > > > > +     .map_lookup_elem = bpf_cgroup_storage_lookup_elem,
> > > > > > +     .map_update_elem = bpf_cgroup_storage_update_elem,
> > > > > > +     .map_delete_elem = bpf_cgroup_storage_delete_elem,
> > > > > > +     .map_check_btf = bpf_local_storage_map_check_btf,
> > > > > > +     .map_btf_id = &cgroup_storage_map_btf_ids[0],
> > > > > > +     .map_owner_storage_ptr = cgroup_storage_ptr,
> > > > > > +};
> > > > > > +
> > > > > > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = {
> > > > > > +     .func           = bpf_cgroup_storage_get,
> > > > > > +     .gpl_only       = false,
> > > > > > +     .ret_type       = RET_PTR_TO_MAP_VALUE_OR_NULL,
> > > > > > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > > > > > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > > > > > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > > > > > +     .arg3_type      = ARG_PTR_TO_MAP_VALUE_OR_NULL,
> > > > > > +     .arg4_type      = ARG_ANYTHING,
> > > > > > +};
> > > > > > +
> > > > > > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = {
> > > > > > +     .func           = bpf_cgroup_storage_delete,
> > > > > > +     .gpl_only       = false,
> > > > > > +     .ret_type       = RET_INTEGER,
> > > > > > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > > > > > +     .arg2_type      = ARG_PTR_TO_BTF_ID,
> > > > > > +     .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> > > > > > +};
> > > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > > > > index a6b04faed282..5c5bb08832ec 100644
> > > > > > --- a/kernel/bpf/helpers.c
> > > > > > +++ b/kernel/bpf/helpers.c
> > > > > > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> > > > > >               return &bpf_dynptr_write_proto;
> > > > > >       case BPF_FUNC_dynptr_data:
> > > > > >               return &bpf_dynptr_data_proto;
> > > > > > +#ifdef CONFIG_CGROUPS
> > > > > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > > > > +             return &bpf_cgroup_storage_get_proto;
> > > > > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > > > > +             return &bpf_cgroup_storage_delete_proto;
> > > > > > +#endif
> > > > > >       default:
> > > > > >               break;
> > > > > >       }
> > > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > > > index 7b373a5e861f..e53c7fae6e22 100644
> > > > > > --- a/kernel/bpf/syscall.c
> > > > > > +++ b/kernel/bpf/syscall.c
> > > > > > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const
> > > > > > struct btf *btf,
> > > > > >                   map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
> > > > > >                   map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
> > > > > >                   map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
> > > > > > -                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> > > > > > +                 map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
> > > > > > +                 map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> > > > > >                       return -ENOTSUPP;
> > > > > >               if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
> > > > > >                   map->value_size) {
> > > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > > > index 6f6d2d511c06..f36f6a3c0d50 100644
> > > > > > --- a/kernel/bpf/verifier.c
> > > > > > +++ b/kernel/bpf/verifier.c
> > > > > > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct
> > > > > > bpf_verifier_env *env,
> > > > > >                   func_id != BPF_FUNC_task_storage_delete)
> > > > > >                       goto error;
> > > > > >               break;
> > > > > > +     case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> > > > > > +             if (func_id != BPF_FUNC_cgroup_local_storage_get &&
> > > > > > +                 func_id != BPF_FUNC_cgroup_local_storage_delete)
> > > > > > +                     goto error;
> > > > > > +             break;
> > > > > >       case BPF_MAP_TYPE_BLOOM_FILTER:
> > > > > >               if (func_id != BPF_FUNC_map_peek_elem &&
> > > > > >                   func_id != BPF_FUNC_map_push_elem)
> > > > > > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct
> > > > > > bpf_verifier_env *env,
> > > > > >               if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> > > > > >                       goto error;
> > > > > >               break;
> > > > > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > > > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > > > > +             if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> > > > > > +                     goto error;
> > > > > > +             break;
> > > > > >       default:
> > > > > >               break;
> > > > > >       }
> > > > > > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct
> > > > > > bpf_verifier_env *env,
> > > > > >               case BPF_MAP_TYPE_INODE_STORAGE:
> > > > > >               case BPF_MAP_TYPE_SK_STORAGE:
> > > > > >               case BPF_MAP_TYPE_TASK_STORAGE:
> > > > > > +             case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> > > > > >                       break;
> > > > > >               default:
> > > > > >                       verbose(env,
> > > > > > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env
> > > > > > *env)
> > > > >
> > > > > >               if (insn->imm == BPF_FUNC_task_storage_get ||
> > > > > >                   insn->imm == BPF_FUNC_sk_storage_get ||
> > > > > > -                 insn->imm == BPF_FUNC_inode_storage_get) {
> > > > > > +                 insn->imm == BPF_FUNC_inode_storage_get ||
> > > > > > +                 insn->imm == BPF_FUNC_cgroup_local_storage_get) {
> > > > > >                       if (env->prog->aux->sleepable)
> > > > > >                               insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL);
> > > > > >                       else
> > > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > > > > index 8ad2c267ff47..2fa2c950c7fb 100644
> > > > > > --- a/kernel/cgroup/cgroup.c
> > > > > > +++ b/kernel/cgroup/cgroup.c
> > > > > > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
> > > > > >               put_css_set_locked(cset->dom_cset);
> > > > > >       }
> > > > >
> > > > > > +#ifdef CONFIG_BPF_SYSCALL
> > > > > > +     bpf_local_cgroup_storage_free(cset->dfl_cgrp);
> > > > > > +#endif
> > > > > > +
> > > >
> > > > I am confused about this freeing site. It seems like this path is for
> > > > freeing css_set's of task_structs, not for freeing the cgroup itself.
> > > > Wouldn't we want to free the local storage when we free the cgroup
> > > > itself? Somewhere like css_free_rwork_fn()? or did I completely miss
> > > > the point here?
> > > >
> > > > > >       kfree_rcu(cset, rcu_head);
> > > > > >   }
> > > > >
> > > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > > > > > index 688552df95ca..179adaae4a9f 100644
> > > > > > --- a/kernel/trace/bpf_trace.c
> > > > > > +++ b/kernel/trace/bpf_trace.c
> > > > > > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id,
> > > > > > const struct bpf_prog *prog)
> > > > > >               return &bpf_get_current_cgroup_id_proto;
> > > > > >       case BPF_FUNC_get_current_ancestor_cgroup_id:
> > > > > >               return &bpf_get_current_ancestor_cgroup_id_proto;
> > > > > > +     case BPF_FUNC_cgroup_local_storage_get:
> > > > > > +             return &bpf_cgroup_storage_get_proto;
> > > > > > +     case BPF_FUNC_cgroup_local_storage_delete:
> > > > > > +             return &bpf_cgroup_storage_delete_proto;
> > > > > >   #endif
> > > > > >       case BPF_FUNC_send_signal:
> > > > > >               return &bpf_send_signal_proto;
> > > > > > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
> > > > > > index c0e6690be82a..fdb0aff8cb5a 100755
> > > > > > --- a/scripts/bpf_doc.py
> > > > > > +++ b/scripts/bpf_doc.py
> > > > > > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
> > > > > >               'struct udp6_sock',
> > > > > >               'struct unix_sock',
> > > > > >               'struct task_struct',
> > > > > > +            'struct cgroup',
> > > > >
> > > > > >               'struct __sk_buff',
> > > > > >               'struct sk_msg_md',
> > > > > > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
> > > > > >               'struct udp6_sock',
> > > > > >               'struct unix_sock',
> > > > > >               'struct task_struct',
> > > > > > +            'struct cgroup',
> > > > > >               'struct path',
> > > > > >               'struct btf_ptr',
> > > > > >               'struct inode',
> > > > > > diff --git a/tools/include/uapi/linux/bpf.h
> > > > > > b/tools/include/uapi/linux/bpf.h
> > > > > > index 17f61338f8f8..d918b4054297 100644
> > > > > > --- a/tools/include/uapi/linux/bpf.h
> > > > > > +++ b/tools/include/uapi/linux/bpf.h
> > > > > > @@ -935,6 +935,7 @@ enum bpf_map_type {
> > > > > >       BPF_MAP_TYPE_TASK_STORAGE,
> > > > > >       BPF_MAP_TYPE_BLOOM_FILTER,
> > > > > >       BPF_MAP_TYPE_USER_RINGBUF,
> > > > > > +     BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> > > > > >   };
> > > > >
> > > > > >   /* Note that tracing related programs such as
> > > > > > @@ -5435,6 +5436,42 @@ union bpf_attr {
> > > > > >    *          **-E2BIG** if user-space has tried to publish a sample which is
> > > > > >    *          larger than the size of the ring buffer, or which cannot fit
> > > > > >    *          within a struct bpf_dynptr.
> > > > > > + *
> > > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup
> > > > > > *cgroup, void *value, u64 flags)
> > > > > > + *   Description
> > > > > > + *           Get a bpf_local_storage from the *cgroup*.
> > > > > > + *
> > > > > > + *           Logically, it could be thought of as getting the value from
> > > > > > + *           a *map* with *cgroup* as the **key**.  From this
> > > > > > + *           perspective,  the usage is not much different from
> > > > > > + *           **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > > > > > + *           helper enforces the key must be a cgroup struct and the map must also
> > > > > > + *           be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> > > > > > + *
> > > > > > + *           Underneath, the value is stored locally at *cgroup* instead of
> > > > > > + *           the *map*.  The *map* is used as the bpf-local-storage
> > > > > > + *           "type". The bpf-local-storage "type" (i.e. the *map*) is
> > > > > > + *           searched against all bpf_local_storage residing at *cgroup*.
> > > > > > + *
> > > > > > + *           An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> > > > > > + *           used such that a new bpf_local_storage will be
> > > > > > + *           created if one does not exist.  *value* can be used
> > > > > > + *           together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> > > > > > + *           the initial value of a bpf_local_storage.  If *value* is
> > > > > > + *           **NULL**, the new bpf_local_storage will be zero initialized.
> > > > > > + *   Return
> > > > > > + *           A bpf_local_storage pointer is returned on success.
> > > > > > + *
> > > > > > + *           **NULL** if not found or there was an error in adding
> > > > > > + *           a new bpf_local_storage.
> > > > > > + *
> > > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> > > > > > cgroup *cgroup)
> > > > > > + *   Description
> > > > > > + *           Delete a bpf_local_storage from a *cgroup*.
> > > > > > + *   Return
> > > > > > + *           0 on success.
> > > > > > + *
> > > > > > + *           **-ENOENT** if the bpf_local_storage cannot be found.
> > > > > >    */
> > > > > >   #define ___BPF_FUNC_MAPPER(FN, ctx...)                      \
> > > > > >       FN(unspec, 0, ##ctx)                            \
> > > > > > @@ -5647,6 +5684,8 @@ union bpf_attr {
> > > > > >       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> > > > > >       FN(ktime_get_tai_ns, 208, ##ctx)                \
> > > > > >       FN(user_ringbuf_drain, 209, ##ctx)              \
> > > > > > +     FN(cgroup_local_storage_get, 210, ##ctx)        \
> > > > > > +     FN(cgroup_local_storage_delete, 211, ##ctx)     \
> > > > > >       /* */
> > > > >
> > > > > >   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that
> > > > > > don't
> > > > > > --
> > > > > > 2.30.2
> > > > >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 19:11             ` Yosry Ahmed
@ 2022-10-17 19:26               ` Tejun Heo
  2022-10-17 21:07               ` Martin KaFai Lau
  1 sibling, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2022-10-17 19:26 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Stanislav Fomichev, Yonghong Song, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau

Hello,

On Mon, Oct 17, 2022 at 12:11:55PM -0700, Yosry Ahmed wrote:
> I agree that it's not ideal, but it feels like we are comparing two
> non-ideal options anyway, I am just throwing ideas around :)

In the spirit of throwing ideas around, I wonder whether the better way to
about it is keeping them separate with clear documentation and figure out a
way to deprecate the old one as AFAICS the new one should be able to do
everything the old one was doing. Would it be an option to, say, make the
verifier warn the users towards converting to the new one and eventually
remove the old one down the line?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 19:11             ` Yosry Ahmed
  2022-10-17 19:26               ` Tejun Heo
@ 2022-10-17 21:07               ` Martin KaFai Lau
  2022-10-17 21:23                 ` Yosry Ahmed
  2022-10-17 22:16                 ` sdf
  1 sibling, 2 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-10-17 21:07 UTC (permalink / raw)
  To: Yosry Ahmed, Yonghong Song, Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

On 10/17/22 12:11 PM, Yosry Ahmed wrote:
> On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>
>>> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
>>>>
>>>> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>
>>>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
>>>>>>
>>>>>> On 10/13, Yonghong Song wrote:
>>>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage.
>>>>>>
>>>>>>> There already exists a local storage implementation for cgroup-attached
>>>>>>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>>>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>>>>>>> attached bpf progs wants to access cgroup local storage data. For example,
>>>>>>> tc egress prog has access to sk and cgroup. It is possible to use
>>>>>>> sk local storage to emulate cgroup local storage by storing data in
>>>>>>> socket.
>>>>>>> But this is a waste as it could be lots of sockets belonging to a
>>>>>>> particular
>>>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as
>>>>>>> the key.
>>>>>>> But this will introduce additional overhead to manipulate the new map.
>>>>>>> A cgroup local storage, similar to existing sk/inode/task storage,
>>>>>>> should help for this use case.
>>>>>>
>>>>>>> The life-cycle of storage is managed with the life-cycle of the
>>>>>>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
>>>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself
>>>>>>> is deleted.
>>>>>>
>>>>>>> The userspace map operations can be done by using a cgroup fd as a key
>>>>>>> passed to the lookup, update and delete operations.
>>>>>>
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
>>>>>>> local
>>>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
>>>>>>> used
>>>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two
>>>>>>> helpers are named as bpf_cgroup_local_storage_get() and
>>>>>>> bpf_cgroup_local_storage_delete().
>>>>>>
>>>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
>>>>>> cgroup storages shared between programs on the same cgroup") where
>>>>>> the map changes its behavior depending on the key size (see key_size checks
>>>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
>>>>>> can be used so we can, in theory, reuse the name..
>>>>>>
>>>>>> Pros:
>>>>>> - no need for a new map name
>>>>>>
>>>>>> Cons:
>>>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
>>>>>>     good idea to add more stuff to it?
>>>>>>
>>>>>> But, for the very least, should we also extend
>>>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
>>>>>> tried to keep some of the important details in there..
>>>>>
>>>>> This might be a long shot, but is it possible to switch completely to
>>>>> this new generic cgroup storage, and for programs that attach to
>>>>> cgroups we can still do lookups/allocations during attachment like we
>>>>> do today? IOW, maintain the current API for cgroup progs but switch it
>>>>> to use this new map type instead.
>>>>>
>>>>> It feels like this map type is more generic and can be a superset of
>>>>> the existing cgroup storage, but I feel like I am missing something.
>>>>
>>>> I feel like the biggest issue is that the existing
>>>> bpf_get_local_storage helper is guaranteed to always return non-null
>>>> and the verifier doesn't require the programs to do null checks on it;
>>>> the new helper might return NULL making all existing programs fail the
>>>> verifier.
>>>
>>> What I meant is, keep the old bpf_get_local_storage helper only for
>>> cgroup-attached programs like we have today, and add a new generic
>>> bpf_cgroup_local_storage_get() helper.
>>>
>>> For cgroup-attached programs, make sure a cgroup storage entry is
>>> allocated and hooked to the helper on program attach time, to keep
>>> today's behavior constant.
>>>
>>> For other programs, the bpf_cgroup_local_storage_get() will do the
>>> normal lookup and allocate if necessary.
>>>
>>> Does this make any sense to you?
>>
>> But then you also need to somehow mark these to make sure it's not
>> possible to delete them as long as the program is loaded/attached? Not
>> saying it's impossible, but it's a bit of a departure from the
>> existing common local storage framework used by inode/task; not sure
>> whether we want to pull all this complexity in there? But we can
>> definitely try if there is a wider agreement..
> 
> I agree that it's not ideal, but it feels like we are comparing two
> non-ideal options anyway, I am just throwing ideas around :)

I don't think it is a good idea to marry the new 
BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing BPF_MAP_TYPE_CGROUP_STORAGE 
in any way.  The API is very different.  A few have already been mentioned here. 
  Delete is one.  Storage creation time is another one.  The map key is also 
different.  Yes, maybe we can reuse the different key size concept in 
bpf_cgroup_storage_key in some way but still feel too much unnecessary quirks 
for the existing sk/inode/task storage users to remember.

imo, it is better to keep them separate and have a different map-type.  Adding a 
map flag or using map extra will make it sounds like an extension which it is not.

>>
>>>> There might be something else I don't remember at this point (besides
>>>> that weird per-prog_type that we'd have to emulate as well)..
>>>
>>> Yeah there are things that will need to be emulated, but I feel like
>>> we may end up with less confusing code (and less code in general).



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 21:07               ` Martin KaFai Lau
@ 2022-10-17 21:23                 ` Yosry Ahmed
  2022-10-17 23:55                   ` Martin KaFai Lau
  2022-10-17 22:16                 ` sdf
  1 sibling, 1 reply; 38+ messages in thread
From: Yosry Ahmed @ 2022-10-17 21:23 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Yonghong Song, Stanislav Fomichev, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau, Tejun Heo

On Mon, Oct 17, 2022 at 2:07 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/17/22 12:11 PM, Yosry Ahmed wrote:
> > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote:
> >>
> >> On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>
> >>> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
> >>>>
> >>>> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>>
> >>>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> >>>>>>
> >>>>>> On 10/13, Yonghong Song wrote:
> >>>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage.
> >>>>>>
> >>>>>>> There already exists a local storage implementation for cgroup-attached
> >>>>>>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> >>>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup
> >>>>>>> attached bpf progs wants to access cgroup local storage data. For example,
> >>>>>>> tc egress prog has access to sk and cgroup. It is possible to use
> >>>>>>> sk local storage to emulate cgroup local storage by storing data in
> >>>>>>> socket.
> >>>>>>> But this is a waste as it could be lots of sockets belonging to a
> >>>>>>> particular
> >>>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as
> >>>>>>> the key.
> >>>>>>> But this will introduce additional overhead to manipulate the new map.
> >>>>>>> A cgroup local storage, similar to existing sk/inode/task storage,
> >>>>>>> should help for this use case.
> >>>>>>
> >>>>>>> The life-cycle of storage is managed with the life-cycle of the
> >>>>>>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> >>>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself
> >>>>>>> is deleted.
> >>>>>>
> >>>>>>> The userspace map operations can be done by using a cgroup fd as a key
> >>>>>>> passed to the lookup, update and delete operations.
> >>>>>>
> >>>>>>
> >>>>>> [..]
> >>>>>>
> >>>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> >>>>>>> local
> >>>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> >>>>>>> used
> >>>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two
> >>>>>>> helpers are named as bpf_cgroup_local_storage_get() and
> >>>>>>> bpf_cgroup_local_storage_delete().
> >>>>>>
> >>>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> >>>>>> cgroup storages shared between programs on the same cgroup") where
> >>>>>> the map changes its behavior depending on the key size (see key_size checks
> >>>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> >>>>>> can be used so we can, in theory, reuse the name..
> >>>>>>
> >>>>>> Pros:
> >>>>>> - no need for a new map name
> >>>>>>
> >>>>>> Cons:
> >>>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> >>>>>>     good idea to add more stuff to it?
> >>>>>>
> >>>>>> But, for the very least, should we also extend
> >>>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> >>>>>> tried to keep some of the important details in there..
> >>>>>
> >>>>> This might be a long shot, but is it possible to switch completely to
> >>>>> this new generic cgroup storage, and for programs that attach to
> >>>>> cgroups we can still do lookups/allocations during attachment like we
> >>>>> do today? IOW, maintain the current API for cgroup progs but switch it
> >>>>> to use this new map type instead.
> >>>>>
> >>>>> It feels like this map type is more generic and can be a superset of
> >>>>> the existing cgroup storage, but I feel like I am missing something.
> >>>>
> >>>> I feel like the biggest issue is that the existing
> >>>> bpf_get_local_storage helper is guaranteed to always return non-null
> >>>> and the verifier doesn't require the programs to do null checks on it;
> >>>> the new helper might return NULL making all existing programs fail the
> >>>> verifier.
> >>>
> >>> What I meant is, keep the old bpf_get_local_storage helper only for
> >>> cgroup-attached programs like we have today, and add a new generic
> >>> bpf_cgroup_local_storage_get() helper.
> >>>
> >>> For cgroup-attached programs, make sure a cgroup storage entry is
> >>> allocated and hooked to the helper on program attach time, to keep
> >>> today's behavior constant.
> >>>
> >>> For other programs, the bpf_cgroup_local_storage_get() will do the
> >>> normal lookup and allocate if necessary.
> >>>
> >>> Does this make any sense to you?
> >>
> >> But then you also need to somehow mark these to make sure it's not
> >> possible to delete them as long as the program is loaded/attached? Not
> >> saying it's impossible, but it's a bit of a departure from the
> >> existing common local storage framework used by inode/task; not sure
> >> whether we want to pull all this complexity in there? But we can
> >> definitely try if there is a wider agreement..
> >
> > I agree that it's not ideal, but it feels like we are comparing two
> > non-ideal options anyway, I am just throwing ideas around :)
>
> I don't think it is a good idea to marry the new
> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing BPF_MAP_TYPE_CGROUP_STORAGE
> in any way.  The API is very different.  A few have already been mentioned here.
>   Delete is one.  Storage creation time is another one.  The map key is also
> different.  Yes, maybe we can reuse the different key size concept in
> bpf_cgroup_storage_key in some way but still feel too much unnecessary quirks
> for the existing sk/inode/task storage users to remember.
>
> imo, it is better to keep them separate and have a different map-type.  Adding a
> map flag or using map extra will make it sounds like an extension which it is not.

I was actually proposing considering the existing cgroup storage as an
extension to the new cgroup local storage. Basically the new cgroup
local storage is a generic cgroup-indexed map, and for cgroup-attached
programs they get some nice extensions, such as preallocation (create
local storage on attachment) and fast lookups (stash a pointer to the
attached cgroup storage for direct access). There are, of course, some
quirks, but it felt to me like something that is easier to reason
about, and less code to maintain.

For the helpers, we can maintain the existing one and generalize it
(get the local storage for my cgroup), and add a new one that we pass
the cgroup into (as in this patch).

My idea is not to have a different flag or key size, but just
basically rework the existing cgroup storage as an extension to the
new one for cgroup-attached programs.

Anyway, like I said I was just throwing ideas around, you have a lot
more background here than me :)

>
> >>
> >>>> There might be something else I don't remember at this point (besides
> >>>> that weird per-prog_type that we'd have to emulate as well)..
> >>>
> >>> Yeah there are things that will need to be emulated, but I feel like
> >>> we may end up with less confusing code (and less code in general).
>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 21:23                 ` Yosry Ahmed
@ 2022-10-17 23:55                   ` Martin KaFai Lau
  2022-10-18  0:47                     ` Yosry Ahmed
  0 siblings, 1 reply; 38+ messages in thread
From: Martin KaFai Lau @ 2022-10-17 23:55 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Yonghong Song, Stanislav Fomichev, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau, Tejun Heo

On 10/17/22 2:23 PM, Yosry Ahmed wrote:
> On Mon, Oct 17, 2022 at 2:07 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/17/22 12:11 PM, Yosry Ahmed wrote:
>>> On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote:
>>>>
>>>> On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>
>>>>> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
>>>>>>
>>>>>> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>>>
>>>>>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
>>>>>>>>
>>>>>>>> On 10/13, Yonghong Song wrote:
>>>>>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage.
>>>>>>>>
>>>>>>>>> There already exists a local storage implementation for cgroup-attached
>>>>>>>>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>>>>>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>>>>>>>>> attached bpf progs wants to access cgroup local storage data. For example,
>>>>>>>>> tc egress prog has access to sk and cgroup. It is possible to use
>>>>>>>>> sk local storage to emulate cgroup local storage by storing data in
>>>>>>>>> socket.
>>>>>>>>> But this is a waste as it could be lots of sockets belonging to a
>>>>>>>>> particular
>>>>>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as
>>>>>>>>> the key.
>>>>>>>>> But this will introduce additional overhead to manipulate the new map.
>>>>>>>>> A cgroup local storage, similar to existing sk/inode/task storage,
>>>>>>>>> should help for this use case.
>>>>>>>>
>>>>>>>>> The life-cycle of storage is managed with the life-cycle of the
>>>>>>>>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
>>>>>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself
>>>>>>>>> is deleted.
>>>>>>>>
>>>>>>>>> The userspace map operations can be done by using a cgroup fd as a key
>>>>>>>>> passed to the lookup, update and delete operations.
>>>>>>>>
>>>>>>>>
>>>>>>>> [..]
>>>>>>>>
>>>>>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
>>>>>>>>> local
>>>>>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
>>>>>>>>> used
>>>>>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two
>>>>>>>>> helpers are named as bpf_cgroup_local_storage_get() and
>>>>>>>>> bpf_cgroup_local_storage_delete().
>>>>>>>>
>>>>>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
>>>>>>>> cgroup storages shared between programs on the same cgroup") where
>>>>>>>> the map changes its behavior depending on the key size (see key_size checks
>>>>>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
>>>>>>>> can be used so we can, in theory, reuse the name..
>>>>>>>>
>>>>>>>> Pros:
>>>>>>>> - no need for a new map name
>>>>>>>>
>>>>>>>> Cons:
>>>>>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
>>>>>>>>      good idea to add more stuff to it?
>>>>>>>>
>>>>>>>> But, for the very least, should we also extend
>>>>>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
>>>>>>>> tried to keep some of the important details in there..
>>>>>>>
>>>>>>> This might be a long shot, but is it possible to switch completely to
>>>>>>> this new generic cgroup storage, and for programs that attach to
>>>>>>> cgroups we can still do lookups/allocations during attachment like we
>>>>>>> do today? IOW, maintain the current API for cgroup progs but switch it
>>>>>>> to use this new map type instead.
>>>>>>>
>>>>>>> It feels like this map type is more generic and can be a superset of
>>>>>>> the existing cgroup storage, but I feel like I am missing something.
>>>>>>
>>>>>> I feel like the biggest issue is that the existing
>>>>>> bpf_get_local_storage helper is guaranteed to always return non-null
>>>>>> and the verifier doesn't require the programs to do null checks on it;
>>>>>> the new helper might return NULL making all existing programs fail the
>>>>>> verifier.
>>>>>
>>>>> What I meant is, keep the old bpf_get_local_storage helper only for
>>>>> cgroup-attached programs like we have today, and add a new generic
>>>>> bpf_cgroup_local_storage_get() helper.
>>>>>
>>>>> For cgroup-attached programs, make sure a cgroup storage entry is
>>>>> allocated and hooked to the helper on program attach time, to keep
>>>>> today's behavior constant.
>>>>>
>>>>> For other programs, the bpf_cgroup_local_storage_get() will do the
>>>>> normal lookup and allocate if necessary.
>>>>>
>>>>> Does this make any sense to you?
>>>>
>>>> But then you also need to somehow mark these to make sure it's not
>>>> possible to delete them as long as the program is loaded/attached? Not
>>>> saying it's impossible, but it's a bit of a departure from the
>>>> existing common local storage framework used by inode/task; not sure
>>>> whether we want to pull all this complexity in there? But we can
>>>> definitely try if there is a wider agreement..
>>>
>>> I agree that it's not ideal, but it feels like we are comparing two
>>> non-ideal options anyway, I am just throwing ideas around :)
>>
>> I don't think it is a good idea to marry the new
>> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing BPF_MAP_TYPE_CGROUP_STORAGE
>> in any way.  The API is very different.  A few have already been mentioned here.
>>    Delete is one.  Storage creation time is another one.  The map key is also
>> different.  Yes, maybe we can reuse the different key size concept in
>> bpf_cgroup_storage_key in some way but still feel too much unnecessary quirks
>> for the existing sk/inode/task storage users to remember.
>>
>> imo, it is better to keep them separate and have a different map-type.  Adding a
>> map flag or using map extra will make it sounds like an extension which it is not.
> 
> I was actually proposing considering the existing cgroup storage as an
> extension to the new cgroup local storage. Basically the new cgroup
> local storage is a generic cgroup-indexed map, and for cgroup-attached
> programs they get some nice extensions, such as preallocation (create
> local storage on attachment) and fast lookups (stash a pointer to the
> attached cgroup storage for direct access). There are, of course, some
> quirks, but it felt to me like something that is easier to reason
> about, and less code to maintain
Like extending the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE implementation and add 
codes to make it work like the existing BPF_MAP_TYPE_CGROUP_STORAGE such that 
those existing code can go away?

hmm.....  A quick thought is it probably does not worth it for the code removal 
purpose alone.  If all use cases can be satisfied by the 
BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, retiring the existing one eventually may be a 
cleaner answer instead of re-factoring it.

Pre-allocation could be useful.  The user space can do it by using 
bpf_map_update_elem syscall with the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE 
before attaching the program.

For fast-lookup/stash pointer, yes, the current limitation on a bpf prog can use 
only one BPF_MAP_TYPE_CGROUP_STORAGE makes this easier.  However, afaik, the 
existing bpf_get_local_storage() is also doing 
current->bpf_ctx->prog_item->cgroup_storage. It is not clear to me which one may 
be faster though.  Need a micro benchmark to tell.

Also, there are quite many code in local_storage.c.  Not sure all of them makes 
sense for the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE to support. eg. 
".map_get_next_key = cgroup_storage_get_next_key".  The new 
BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE does not support iteration from the user space 
because it has bpf_iter that supports iteration by a bpf prog which can directly 
get to the kernel ptr (task/sk/...) instead of a fd.

In the future, we will add feature to bpf_local_storage.c that will work for all 
kernel objects whenever possible. eg. Adding map-in-map in the 
sk/inode/task/cgroup local storage, and store a ring-buf map to the sk (eg) 
storage.  The inner map may not always make sense to be created during the 
cgroup-attach time and it will be another exception to make for the 
alloc-during-cgroup-attach behavior.

> 
> For the helpers, we can maintain the existing one and generalize it
> (get the local storage for my cgroup), and add a new one that we pass
> the cgroup into (as in this patch).
> 
> My idea is not to have a different flag or key size, but just
> basically rework the existing cgroup storage as an extension to the
> new one for cgroup-attached programs.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 23:55                   ` Martin KaFai Lau
@ 2022-10-18  0:47                     ` Yosry Ahmed
  0 siblings, 0 replies; 38+ messages in thread
From: Yosry Ahmed @ 2022-10-18  0:47 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Yonghong Song, Stanislav Fomichev, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau, Tejun Heo

On Mon, Oct 17, 2022 at 4:55 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/17/22 2:23 PM, Yosry Ahmed wrote:
> > On Mon, Oct 17, 2022 at 2:07 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 10/17/22 12:11 PM, Yosry Ahmed wrote:
> >>> On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote:
> >>>>
> >>>> On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>>
> >>>>> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
> >>>>>>
> >>>>>> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>>>>
> >>>>>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> >>>>>>>>
> >>>>>>>> On 10/13, Yonghong Song wrote:
> >>>>>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage.
> >>>>>>>>
> >>>>>>>>> There already exists a local storage implementation for cgroup-attached
> >>>>>>>>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> >>>>>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup
> >>>>>>>>> attached bpf progs wants to access cgroup local storage data. For example,
> >>>>>>>>> tc egress prog has access to sk and cgroup. It is possible to use
> >>>>>>>>> sk local storage to emulate cgroup local storage by storing data in
> >>>>>>>>> socket.
> >>>>>>>>> But this is a waste as it could be lots of sockets belonging to a
> >>>>>>>>> particular
> >>>>>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as
> >>>>>>>>> the key.
> >>>>>>>>> But this will introduce additional overhead to manipulate the new map.
> >>>>>>>>> A cgroup local storage, similar to existing sk/inode/task storage,
> >>>>>>>>> should help for this use case.
> >>>>>>>>
> >>>>>>>>> The life-cycle of storage is managed with the life-cycle of the
> >>>>>>>>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> >>>>>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself
> >>>>>>>>> is deleted.
> >>>>>>>>
> >>>>>>>>> The userspace map operations can be done by using a cgroup fd as a key
> >>>>>>>>> passed to the lookup, update and delete operations.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [..]
> >>>>>>>>
> >>>>>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> >>>>>>>>> local
> >>>>>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> >>>>>>>>> used
> >>>>>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two
> >>>>>>>>> helpers are named as bpf_cgroup_local_storage_get() and
> >>>>>>>>> bpf_cgroup_local_storage_delete().
> >>>>>>>>
> >>>>>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> >>>>>>>> cgroup storages shared between programs on the same cgroup") where
> >>>>>>>> the map changes its behavior depending on the key size (see key_size checks
> >>>>>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> >>>>>>>> can be used so we can, in theory, reuse the name..
> >>>>>>>>
> >>>>>>>> Pros:
> >>>>>>>> - no need for a new map name
> >>>>>>>>
> >>>>>>>> Cons:
> >>>>>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> >>>>>>>>      good idea to add more stuff to it?
> >>>>>>>>
> >>>>>>>> But, for the very least, should we also extend
> >>>>>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> >>>>>>>> tried to keep some of the important details in there..
> >>>>>>>
> >>>>>>> This might be a long shot, but is it possible to switch completely to
> >>>>>>> this new generic cgroup storage, and for programs that attach to
> >>>>>>> cgroups we can still do lookups/allocations during attachment like we
> >>>>>>> do today? IOW, maintain the current API for cgroup progs but switch it
> >>>>>>> to use this new map type instead.
> >>>>>>>
> >>>>>>> It feels like this map type is more generic and can be a superset of
> >>>>>>> the existing cgroup storage, but I feel like I am missing something.
> >>>>>>
> >>>>>> I feel like the biggest issue is that the existing
> >>>>>> bpf_get_local_storage helper is guaranteed to always return non-null
> >>>>>> and the verifier doesn't require the programs to do null checks on it;
> >>>>>> the new helper might return NULL making all existing programs fail the
> >>>>>> verifier.
> >>>>>
> >>>>> What I meant is, keep the old bpf_get_local_storage helper only for
> >>>>> cgroup-attached programs like we have today, and add a new generic
> >>>>> bpf_cgroup_local_storage_get() helper.
> >>>>>
> >>>>> For cgroup-attached programs, make sure a cgroup storage entry is
> >>>>> allocated and hooked to the helper on program attach time, to keep
> >>>>> today's behavior constant.
> >>>>>
> >>>>> For other programs, the bpf_cgroup_local_storage_get() will do the
> >>>>> normal lookup and allocate if necessary.
> >>>>>
> >>>>> Does this make any sense to you?
> >>>>
> >>>> But then you also need to somehow mark these to make sure it's not
> >>>> possible to delete them as long as the program is loaded/attached? Not
> >>>> saying it's impossible, but it's a bit of a departure from the
> >>>> existing common local storage framework used by inode/task; not sure
> >>>> whether we want to pull all this complexity in there? But we can
> >>>> definitely try if there is a wider agreement..
> >>>
> >>> I agree that it's not ideal, but it feels like we are comparing two
> >>> non-ideal options anyway, I am just throwing ideas around :)
> >>
> >> I don't think it is a good idea to marry the new
> >> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing BPF_MAP_TYPE_CGROUP_STORAGE
> >> in any way.  The API is very different.  A few have already been mentioned here.
> >>    Delete is one.  Storage creation time is another one.  The map key is also
> >> different.  Yes, maybe we can reuse the different key size concept in
> >> bpf_cgroup_storage_key in some way but still feel too much unnecessary quirks
> >> for the existing sk/inode/task storage users to remember.
> >>
> >> imo, it is better to keep them separate and have a different map-type.  Adding a
> >> map flag or using map extra will make it sounds like an extension which it is not.
> >
> > I was actually proposing considering the existing cgroup storage as an
> > extension to the new cgroup local storage. Basically the new cgroup
> > local storage is a generic cgroup-indexed map, and for cgroup-attached
> > programs they get some nice extensions, such as preallocation (create
> > local storage on attachment) and fast lookups (stash a pointer to the
> > attached cgroup storage for direct access). There are, of course, some
> > quirks, but it felt to me like something that is easier to reason
> > about, and less code to maintain
> Like extending the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE implementation and add
> codes to make it work like the existing BPF_MAP_TYPE_CGROUP_STORAGE such that
> those existing code can go away?
>
> hmm.....  A quick thought is it probably does not worth it for the code removal
> purpose alone.  If all use cases can be satisfied by the
> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, retiring the existing one eventually may be a
> cleaner answer instead of re-factoring it.
>
> Pre-allocation could be useful.  The user space can do it by using
> bpf_map_update_elem syscall with the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE
> before attaching the program.
>
> For fast-lookup/stash pointer, yes, the current limitation on a bpf prog can use
> only one BPF_MAP_TYPE_CGROUP_STORAGE makes this easier.  However, afaik, the
> existing bpf_get_local_storage() is also doing
> current->bpf_ctx->prog_item->cgroup_storage. It is not clear to me which one may
> be faster though.  Need a micro benchmark to tell.
>
> Also, there are quite many code in local_storage.c.  Not sure all of them makes
> sense for the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE to support. eg.
> ".map_get_next_key = cgroup_storage_get_next_key".  The new
> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE does not support iteration from the user space
> because it has bpf_iter that supports iteration by a bpf prog which can directly
> get to the kernel ptr (task/sk/...) instead of a fd.
>
> In the future, we will add feature to bpf_local_storage.c that will work for all
> kernel objects whenever possible. eg. Adding map-in-map in the
> sk/inode/task/cgroup local storage, and store a ring-buf map to the sk (eg)
> storage.  The inner map may not always make sense to be created during the
> cgroup-attach time and it will be another exception to make for the
> alloc-during-cgroup-attach behavior.
>
> >
> > For the helpers, we can maintain the existing one and generalize it
> > (get the local storage for my cgroup), and add a new one that we pass
> > the cgroup into (as in this patch).
> >
> > My idea is not to have a different flag or key size, but just
> > basically rework the existing cgroup storage as an extension to the
> > new one for cgroup-attached programs.
>

I see what you mean, thanks for clarifying your thoughts. I think
retiring the old cgroup storage at some point if all its use cases
become covered by the new cgroup local storage. Meanwhile, we will
need clear docs in the code and for users to draw a clear distinction
between both map types.

>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 21:07               ` Martin KaFai Lau
  2022-10-17 21:23                 ` Yosry Ahmed
@ 2022-10-17 22:16                 ` sdf
  2022-10-18  0:52                   ` Martin KaFai Lau
  1 sibling, 1 reply; 38+ messages in thread
From: sdf @ 2022-10-17 22:16 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau, Tejun Heo

On 10/17, Martin KaFai Lau wrote:
> On 10/17/22 12:11 PM, Yosry Ahmed wrote:
> > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com>  
> wrote:
> > >
> > > On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com>  
> wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev  
> <sdf@google.com> wrote:
> > > > >
> > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed  
> <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> > > > > > >
> > > > > > > On 10/13, Yonghong Song wrote:
> > > > > > > > Similar to sk/inode/task storage, implement similar cgroup  
> local storage.
> > > > > > >
> > > > > > > > There already exists a local storage implementation for  
> cgroup-attached
> > > > > > > > bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and  
> helper
> > > > > > > > bpf_get_local_storage(). But there are use cases such that  
> non-cgroup
> > > > > > > > attached bpf progs wants to access cgroup local storage  
> data. For example,
> > > > > > > > tc egress prog has access to sk and cgroup. It is possible  
> to use
> > > > > > > > sk local storage to emulate cgroup local storage by storing  
> data in
> > > > > > > > socket.
> > > > > > > > But this is a waste as it could be lots of sockets  
> belonging to a
> > > > > > > > particular
> > > > > > > > cgroup. Alternatively, a separate map can be created with  
> cgroup id as
> > > > > > > > the key.
> > > > > > > > But this will introduce additional overhead to manipulate  
> the new map.
> > > > > > > > A cgroup local storage, similar to existing sk/inode/task  
> storage,
> > > > > > > > should help for this use case.
> > > > > > >
> > > > > > > > The life-cycle of storage is managed with the life-cycle of  
> the
> > > > > > > > cgroup struct.  i.e. the storage is destroyed along with  
> the owning cgroup
> > > > > > > > with a callback to the bpf_cgroup_storage_free when cgroup  
> itself
> > > > > > > > is deleted.
> > > > > > >
> > > > > > > > The userspace map operations can be done by using a cgroup  
> fd as a key
> > > > > > > > passed to the lookup, update and delete operations.
> > > > > > >
> > > > > > >
> > > > > > > [..]
> > > > > > >
> > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used  
> for old cgroup
> > > > > > > > local
> > > > > > > > storage support, the new map name  
> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> > > > > > > > used
> > > > > > > > for cgroup storage available to non-cgroup-attached bpf  
> programs. The two
> > > > > > > > helpers are named as bpf_cgroup_local_storage_get() and
> > > > > > > > bpf_cgroup_local_storage_delete().
> > > > > > >
> > > > > > > Have you considered doing something similar to 7d9c3427894f  
> ("bpf: Make
> > > > > > > cgroup storages shared between programs on the same cgroup")  
> where
> > > > > > > the map changes its behavior depending on the key size (see  
> key_size checks
> > > > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd  
> still
> > > > > > > can be used so we can, in theory, reuse the name..
> > > > > > >
> > > > > > > Pros:
> > > > > > > - no need for a new map name
> > > > > > >
> > > > > > > Cons:
> > > > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy;  
> might be not a
> > > > > > >     good idea to add more stuff to it?
> > > > > > >
> > > > > > > But, for the very least, should we also extend
> > > > > > > Documentation/bpf/map_cgroup_storage.rst to cover the new  
> map? We've
> > > > > > > tried to keep some of the important details in there..
> > > > > >
> > > > > > This might be a long shot, but is it possible to switch  
> completely to
> > > > > > this new generic cgroup storage, and for programs that attach to
> > > > > > cgroups we can still do lookups/allocations during attachment  
> like we
> > > > > > do today? IOW, maintain the current API for cgroup progs but  
> switch it
> > > > > > to use this new map type instead.
> > > > > >
> > > > > > It feels like this map type is more generic and can be a  
> superset of
> > > > > > the existing cgroup storage, but I feel like I am missing  
> something.
> > > > >
> > > > > I feel like the biggest issue is that the existing
> > > > > bpf_get_local_storage helper is guaranteed to always return  
> non-null
> > > > > and the verifier doesn't require the programs to do null checks  
> on it;
> > > > > the new helper might return NULL making all existing programs  
> fail the
> > > > > verifier.
> > > >
> > > > What I meant is, keep the old bpf_get_local_storage helper only for
> > > > cgroup-attached programs like we have today, and add a new generic
> > > > bpf_cgroup_local_storage_get() helper.
> > > >
> > > > For cgroup-attached programs, make sure a cgroup storage entry is
> > > > allocated and hooked to the helper on program attach time, to keep
> > > > today's behavior constant.
> > > >
> > > > For other programs, the bpf_cgroup_local_storage_get() will do the
> > > > normal lookup and allocate if necessary.
> > > >
> > > > Does this make any sense to you?
> > >
> > > But then you also need to somehow mark these to make sure it's not
> > > possible to delete them as long as the program is loaded/attached? Not
> > > saying it's impossible, but it's a bit of a departure from the
> > > existing common local storage framework used by inode/task; not sure
> > > whether we want to pull all this complexity in there? But we can
> > > definitely try if there is a wider agreement..
> >
> > I agree that it's not ideal, but it feels like we are comparing two
> > non-ideal options anyway, I am just throwing ideas around :)

> I don't think it is a good idea to marry the new
> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing
> BPF_MAP_TYPE_CGROUP_STORAGE in any way.  The API is very different.  A few
> have already been mentioned here.  Delete is one.  Storage creation time  
> is
> another one.  The map key is also different.  Yes, maybe we can reuse the
> different key size concept in bpf_cgroup_storage_key in some way but still
> feel too much unnecessary quirks for the existing sk/inode/task storage
> users to remember.

> imo, it is better to keep them separate and have a different map-type.
> Adding a map flag or using map extra will make it sounds like an extension
> which it is not.

This part is the most confusing to me:

BPF_MAP_TYPE_CGROUP_STORAGE       bpf_get_local_storage
BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_local_storage_get

The new helpers should probably drop 'local' name to match the task/inode  
([0])?
And we're left with:

BPF_MAP_TYPE_CGROUP_STORAGE       bpf_get_local_storage
BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_storage_get

You read CGROUP_STORAGE via get_local_storage and
you read CGROUP_LOCAL_STORAGE via cgroup_storage_get :-/

That's why I'm slightly tilting towards reusing the name. At least we can
add a big DEPRECATED message for bpf_get_local_storage and that seems to be
it? All those extra key sizes can also be deprecated, but I'm honestly
not sure if anybody is using them.

But having a separate map also seems fine, as long as we have a patch to
update the existing header documentation. (and mention in
Documentation/bpf/map_cgroup_storage.rst that there is a replacement?)
Current bpf_get_local_storage description is too vague; let's at least
mention that it works only with BPF_MAP_TYPE_CGROUP_STORAGE.

0:  
https://lore.kernel.org/bpf/6ce7d490-f015-531f-3dbb-b6f7717f0590@meta.com/T/#mb2107250caa19a8d9ec3549a52f4a9698be99e33

> > >
> > > > > There might be something else I don't remember at this point  
> (besides
> > > > > that weird per-prog_type that we'd have to emulate as well)..
> > > >
> > > > Yeah there are things that will need to be emulated, but I feel like
> > > > we may end up with less confusing code (and less code in general).



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 22:16                 ` sdf
@ 2022-10-18  0:52                   ` Martin KaFai Lau
  2022-10-18  5:59                     ` Yonghong Song
  0 siblings, 1 reply; 38+ messages in thread
From: Martin KaFai Lau @ 2022-10-18  0:52 UTC (permalink / raw)
  To: sdf
  Cc: Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau, Tejun Heo

On 10/17/22 3:16 PM, sdf@google.com wrote:
> On 10/17, Martin KaFai Lau wrote:
>> On 10/17/22 12:11 PM, Yosry Ahmed wrote:
>> > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote:
>> > >
>> > > On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>> > > >
>> > > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
>> > > > >
>> > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> 
>> wrote:
>> > > > > >
>> > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
>> > > > > > >
>> > > > > > > On 10/13, Yonghong Song wrote:
>> > > > > > > > Similar to sk/inode/task storage, implement similar cgroup local 
>> storage.
>> > > > > > >
>> > > > > > > > There already exists a local storage implementation for 
>> cgroup-attached
>> > > > > > > > bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>> > > > > > > > bpf_get_local_storage(). But there are use cases such that 
>> non-cgroup
>> > > > > > > > attached bpf progs wants to access cgroup local storage data. 
>> For example,
>> > > > > > > > tc egress prog has access to sk and cgroup. It is possible to use
>> > > > > > > > sk local storage to emulate cgroup local storage by storing data in
>> > > > > > > > socket.
>> > > > > > > > But this is a waste as it could be lots of sockets belonging to a
>> > > > > > > > particular
>> > > > > > > > cgroup. Alternatively, a separate map can be created with cgroup 
>> id as
>> > > > > > > > the key.
>> > > > > > > > But this will introduce additional overhead to manipulate the 
>> new map.
>> > > > > > > > A cgroup local storage, similar to existing sk/inode/task storage,
>> > > > > > > > should help for this use case.
>> > > > > > >
>> > > > > > > > The life-cycle of storage is managed with the life-cycle of the
>> > > > > > > > cgroup struct.  i.e. the storage is destroyed along with the 
>> owning cgroup
>> > > > > > > > with a callback to the bpf_cgroup_storage_free when cgroup itself
>> > > > > > > > is deleted.
>> > > > > > >
>> > > > > > > > The userspace map operations can be done by using a cgroup fd as 
>> a key
>> > > > > > > > passed to the lookup, update and delete operations.
>> > > > > > >
>> > > > > > >
>> > > > > > > [..]
>> > > > > > >
>> > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old 
>> cgroup
>> > > > > > > > local
>> > > > > > > > storage support, the new map name 
>> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
>> > > > > > > > used
>> > > > > > > > for cgroup storage available to non-cgroup-attached bpf 
>> programs. The two
>> > > > > > > > helpers are named as bpf_cgroup_local_storage_get() and
>> > > > > > > > bpf_cgroup_local_storage_delete().
>> > > > > > >
>> > > > > > > Have you considered doing something similar to 7d9c3427894f ("bpf: 
>> Make
>> > > > > > > cgroup storages shared between programs on the same cgroup") where
>> > > > > > > the map changes its behavior depending on the key size (see 
>> key_size checks
>> > > > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
>> > > > > > > can be used so we can, in theory, reuse the name..
>> > > > > > >
>> > > > > > > Pros:
>> > > > > > > - no need for a new map name
>> > > > > > >
>> > > > > > > Cons:
>> > > > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be 
>> not a
>> > > > > > >     good idea to add more stuff to it?
>> > > > > > >
>> > > > > > > But, for the very least, should we also extend
>> > > > > > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
>> > > > > > > tried to keep some of the important details in there..
>> > > > > >
>> > > > > > This might be a long shot, but is it possible to switch completely to
>> > > > > > this new generic cgroup storage, and for programs that attach to
>> > > > > > cgroups we can still do lookups/allocations during attachment like we
>> > > > > > do today? IOW, maintain the current API for cgroup progs but switch it
>> > > > > > to use this new map type instead.
>> > > > > >
>> > > > > > It feels like this map type is more generic and can be a superset of
>> > > > > > the existing cgroup storage, but I feel like I am missing something.
>> > > > >
>> > > > > I feel like the biggest issue is that the existing
>> > > > > bpf_get_local_storage helper is guaranteed to always return non-null
>> > > > > and the verifier doesn't require the programs to do null checks on it;
>> > > > > the new helper might return NULL making all existing programs fail the
>> > > > > verifier.
>> > > >
>> > > > What I meant is, keep the old bpf_get_local_storage helper only for
>> > > > cgroup-attached programs like we have today, and add a new generic
>> > > > bpf_cgroup_local_storage_get() helper.
>> > > >
>> > > > For cgroup-attached programs, make sure a cgroup storage entry is
>> > > > allocated and hooked to the helper on program attach time, to keep
>> > > > today's behavior constant.
>> > > >
>> > > > For other programs, the bpf_cgroup_local_storage_get() will do the
>> > > > normal lookup and allocate if necessary.
>> > > >
>> > > > Does this make any sense to you?
>> > >
>> > > But then you also need to somehow mark these to make sure it's not
>> > > possible to delete them as long as the program is loaded/attached? Not
>> > > saying it's impossible, but it's a bit of a departure from the
>> > > existing common local storage framework used by inode/task; not sure
>> > > whether we want to pull all this complexity in there? But we can
>> > > definitely try if there is a wider agreement..
>> >
>> > I agree that it's not ideal, but it feels like we are comparing two
>> > non-ideal options anyway, I am just throwing ideas around :)
> 
>> I don't think it is a good idea to marry the new
>> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing
>> BPF_MAP_TYPE_CGROUP_STORAGE in any way.  The API is very different.  A few
>> have already been mentioned here.  Delete is one.  Storage creation time is
>> another one.  The map key is also different.  Yes, maybe we can reuse the
>> different key size concept in bpf_cgroup_storage_key in some way but still
>> feel too much unnecessary quirks for the existing sk/inode/task storage
>> users to remember.
> 
>> imo, it is better to keep them separate and have a different map-type.
>> Adding a map flag or using map extra will make it sounds like an extension
>> which it is not.
> 
> This part is the most confusing to me:
> 
> BPF_MAP_TYPE_CGROUP_STORAGE       bpf_get_local_storage
> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_local_storage_get
> 
> The new helpers should probably drop 'local' name to match the task/inode ([0])?
> And we're left with:
> 
> BPF_MAP_TYPE_CGROUP_STORAGE       bpf_get_local_storage
> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_storage_get
> 
> You read CGROUP_STORAGE via get_local_storage and
> you read CGROUP_LOCAL_STORAGE via cgroup_storage_get :-/

Yep, agree that it is not ideal :(

> 
> That's why I'm slightly tilting towards reusing the name. At least we can
> add a big DEPRECATED message for bpf_get_local_storage and that seems to be
> it? All those extra key sizes can also be deprecated, but I'm honestly
> not sure if anybody is using them.

Reusing 'key_size == sizeof(int)' to mean new map type...hmm...  I have been 
thinking about it after your suggestion in another reply since it can use the 
BPF_MAP_TYPE_CGROUP_STORAGE name.  I wish the BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE 
was given to the bpf_get_local_storage() instead because it is a better name to 
describe what it is doing.

hmm.... However, this feels working like a map_flags or map_extra but in a more 
hidden way.  I am worry it will actually be more confusing and also having usage 
surprises when there are quite many behavior differences that this thread has 
already mentioned.  That will be hard for the user to reason those API 
differences just because of using a different key_size.

May be going back to revisit the naming a little bit.  How about giving a new 
and likely more correct 'BPF_MAP_TYPE_CGRP_LOCAL_STORAGE' name for the existing 
bpf_get_local_storage() use.  Then

'#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* 
depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi.

The new cgroup storage uses a shorter name "cgrp", like 
BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()?

> 
> But having a separate map also seems fine, as long as we have a patch to
> update the existing header documentation. (and mention in
> Documentation/bpf/map_cgroup_storage.rst that there is a replacement?)
> Current bpf_get_local_storage description is too vague; let's at least
> mention that it works only with BPF_MAP_TYPE_CGROUP_STORAGE.
> 
> 0: 
> https://lore.kernel.org/bpf/6ce7d490-f015-531f-3dbb-b6f7717f0590@meta.com/T/#mb2107250caa19a8d9ec3549a52f4a9698be99e33
> 
>> > >
>> > > > > There might be something else I don't remember at this point (besides
>> > > > > that weird per-prog_type that we'd have to emulate as well)..
>> > > >
>> > > > Yeah there are things that will need to be emulated, but I feel like
>> > > > we may end up with less confusing code (and less code in general).
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-18  0:52                   ` Martin KaFai Lau
@ 2022-10-18  5:59                     ` Yonghong Song
  2022-10-18 17:08                       ` sdf
  0 siblings, 1 reply; 38+ messages in thread
From: Yonghong Song @ 2022-10-18  5:59 UTC (permalink / raw)
  To: Martin KaFai Lau, sdf
  Cc: Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau, Tejun Heo



On 10/17/22 5:52 PM, Martin KaFai Lau wrote:
> On 10/17/22 3:16 PM, sdf@google.com wrote:
>> On 10/17, Martin KaFai Lau wrote:
>>> On 10/17/22 12:11 PM, Yosry Ahmed wrote:
>>> > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev 
>>> <sdf@google.com> wrote:
>>> > >
>>> > > On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed 
>>> <yosryahmed@google.com> wrote:
>>> > > >
>>> > > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev 
>>> <sdf@google.com> wrote:
>>> > > > >
>>> > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed 
>>> <yosryahmed@google.com> wrote:
>>> > > > > >
>>> > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
>>> > > > > > >
>>> > > > > > > On 10/13, Yonghong Song wrote:
>>> > > > > > > > Similar to sk/inode/task storage, implement similar 
>>> cgroup local storage.
>>> > > > > > >
>>> > > > > > > > There already exists a local storage implementation for 
>>> cgroup-attached
>>> > > > > > > > bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE 
>>> and helper
>>> > > > > > > > bpf_get_local_storage(). But there are use cases such 
>>> that non-cgroup
>>> > > > > > > > attached bpf progs wants to access cgroup local storage 
>>> data. For example,
>>> > > > > > > > tc egress prog has access to sk and cgroup. It is 
>>> possible to use
>>> > > > > > > > sk local storage to emulate cgroup local storage by 
>>> storing data in
>>> > > > > > > > socket.
>>> > > > > > > > But this is a waste as it could be lots of sockets 
>>> belonging to a
>>> > > > > > > > particular
>>> > > > > > > > cgroup. Alternatively, a separate map can be created 
>>> with cgroup id as
>>> > > > > > > > the key.
>>> > > > > > > > But this will introduce additional overhead to 
>>> manipulate the new map.
>>> > > > > > > > A cgroup local storage, similar to existing 
>>> sk/inode/task storage,
>>> > > > > > > > should help for this use case.
>>> > > > > > >
>>> > > > > > > > The life-cycle of storage is managed with the 
>>> life-cycle of the
>>> > > > > > > > cgroup struct.  i.e. the storage is destroyed along 
>>> with the owning cgroup
>>> > > > > > > > with a callback to the bpf_cgroup_storage_free when 
>>> cgroup itself
>>> > > > > > > > is deleted.
>>> > > > > > >
>>> > > > > > > > The userspace map operations can be done by using a 
>>> cgroup fd as a key
>>> > > > > > > > passed to the lookup, update and delete operations.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > [..]
>>> > > > > > >
>>> > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been 
>>> used for old cgroup
>>> > > > > > > > local
>>> > > > > > > > storage support, the new map name 
>>> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
>>> > > > > > > > used
>>> > > > > > > > for cgroup storage available to non-cgroup-attached bpf 
>>> programs. The two
>>> > > > > > > > helpers are named as bpf_cgroup_local_storage_get() and
>>> > > > > > > > bpf_cgroup_local_storage_delete().
>>> > > > > > >
>>> > > > > > > Have you considered doing something similar to 
>>> 7d9c3427894f ("bpf: Make
>>> > > > > > > cgroup storages shared between programs on the same 
>>> cgroup") where
>>> > > > > > > the map changes its behavior depending on the key size 
>>> (see key_size checks
>>> > > > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for 
>>> fd still
>>> > > > > > > can be used so we can, in theory, reuse the name..
>>> > > > > > >
>>> > > > > > > Pros:
>>> > > > > > > - no need for a new map name
>>> > > > > > >
>>> > > > > > > Cons:
>>> > > > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; 
>>> might be not a
>>> > > > > > >     good idea to add more stuff to it?
>>> > > > > > >
>>> > > > > > > But, for the very least, should we also extend
>>> > > > > > > Documentation/bpf/map_cgroup_storage.rst to cover the new 
>>> map? We've
>>> > > > > > > tried to keep some of the important details in there..
>>> > > > > >
>>> > > > > > This might be a long shot, but is it possible to switch 
>>> completely to
>>> > > > > > this new generic cgroup storage, and for programs that 
>>> attach to
>>> > > > > > cgroups we can still do lookups/allocations during 
>>> attachment like we
>>> > > > > > do today? IOW, maintain the current API for cgroup progs 
>>> but switch it
>>> > > > > > to use this new map type instead.
>>> > > > > >
>>> > > > > > It feels like this map type is more generic and can be a 
>>> superset of
>>> > > > > > the existing cgroup storage, but I feel like I am missing 
>>> something.
>>> > > > >
>>> > > > > I feel like the biggest issue is that the existing
>>> > > > > bpf_get_local_storage helper is guaranteed to always return 
>>> non-null
>>> > > > > and the verifier doesn't require the programs to do null 
>>> checks on it;
>>> > > > > the new helper might return NULL making all existing programs 
>>> fail the
>>> > > > > verifier.
>>> > > >
>>> > > > What I meant is, keep the old bpf_get_local_storage helper only 
>>> for
>>> > > > cgroup-attached programs like we have today, and add a new generic
>>> > > > bpf_cgroup_local_storage_get() helper.
>>> > > >
>>> > > > For cgroup-attached programs, make sure a cgroup storage entry is
>>> > > > allocated and hooked to the helper on program attach time, to keep
>>> > > > today's behavior constant.
>>> > > >
>>> > > > For other programs, the bpf_cgroup_local_storage_get() will do the
>>> > > > normal lookup and allocate if necessary.
>>> > > >
>>> > > > Does this make any sense to you?
>>> > >
>>> > > But then you also need to somehow mark these to make sure it's not
>>> > > possible to delete them as long as the program is 
>>> loaded/attached? Not
>>> > > saying it's impossible, but it's a bit of a departure from the
>>> > > existing common local storage framework used by inode/task; not sure
>>> > > whether we want to pull all this complexity in there? But we can
>>> > > definitely try if there is a wider agreement..
>>> >
>>> > I agree that it's not ideal, but it feels like we are comparing two
>>> > non-ideal options anyway, I am just throwing ideas around :)
>>
>>> I don't think it is a good idea to marry the new
>>> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing
>>> BPF_MAP_TYPE_CGROUP_STORAGE in any way.  The API is very different.  
>>> A few
>>> have already been mentioned here.  Delete is one.  Storage creation 
>>> time is
>>> another one.  The map key is also different.  Yes, maybe we can reuse 
>>> the
>>> different key size concept in bpf_cgroup_storage_key in some way but 
>>> still
>>> feel too much unnecessary quirks for the existing sk/inode/task storage
>>> users to remember.
>>
>>> imo, it is better to keep them separate and have a different map-type.
>>> Adding a map flag or using map extra will make it sounds like an 
>>> extension
>>> which it is not.
>>
>> This part is the most confusing to me:
>>
>> BPF_MAP_TYPE_CGROUP_STORAGE       bpf_get_local_storage
>> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_local_storage_get
>>
>> The new helpers should probably drop 'local' name to match the 
>> task/inode ([0])?
>> And we're left with:
>>
>> BPF_MAP_TYPE_CGROUP_STORAGE       bpf_get_local_storage
>> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_storage_get
>>
>> You read CGROUP_STORAGE via get_local_storage and
>> you read CGROUP_LOCAL_STORAGE via cgroup_storage_get :-/
> 
> Yep, agree that it is not ideal :(

I guess I need to add more documentation to explain the difference
of old and new map regardless of the final names.

> 
>>
>> That's why I'm slightly tilting towards reusing the name. At least we can
>> add a big DEPRECATED message for bpf_get_local_storage and that seems 
>> to be
>> it? All those extra key sizes can also be deprecated, but I'm honestly
>> not sure if anybody is using them.
> 
> Reusing 'key_size == sizeof(int)' to mean new map type...hmm...  I have 
> been thinking about it after your suggestion in another reply since it 
> can use the BPF_MAP_TYPE_CGROUP_STORAGE name.  I wish the 
> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE was given to the 
> bpf_get_local_storage() instead because it is a better name to describe 
> what it is doing.
> 
> hmm.... However, this feels working like a map_flags or map_extra but in 
> a more hidden way.  I am worry it will actually be more confusing and 
> also having usage surprises when there are quite many behavior 
> differences that this thread has already mentioned.  That will be hard 
> for the user to reason those API differences just because of using a 
> different key_size.
> 
> May be going back to revisit the naming a little bit.  How about giving 
> a new and likely more correct 'BPF_MAP_TYPE_CGRP_LOCAL_STORAGE' name for 
> the existing bpf_get_local_storage() use.  Then
> 
> '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* 
> depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi.
> 
> The new cgroup storage uses a shorter name "cgrp", like 
> BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()?

This might work and the naming convention will be similar to
existing sk/inode/task storage.

Another alternative is to name the map name as
     BPF_MAP_TYPE_CGROUP_STORAGE2
to indicate it is a different version of cgroup_storage map
and the documentation should explain the difference clearly.
This should avoid the possible confusion between
BPF_MAP_TYPE_CGROUP_STORAGE and BPF_MAP_TYPE_CGRP_STORAGE.

> 
>>
>> But having a separate map also seems fine, as long as we have a patch to
>> update the existing header documentation. (and mention in
>> Documentation/bpf/map_cgroup_storage.rst that there is a replacement?)
>> Current bpf_get_local_storage description is too vague; let's at least
>> mention that it works only with BPF_MAP_TYPE_CGROUP_STORAGE.
>>
>> 0: 
>> https://lore.kernel.org/bpf/6ce7d490-f015-531f-3dbb-b6f7717f0590@meta.com/T/#mb2107250caa19a8d9ec3549a52f4a9698be99e33
>>
>>> > >
>>> > > > > There might be something else I don't remember at this point 
>>> (besides
>>> > > > > that weird per-prog_type that we'd have to emulate as well)..
>>> > > >
>>> > > > Yeah there are things that will need to be emulated, but I feel 
>>> like
>>> > > > we may end up with less confusing code (and less code in general).
>>
>>
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-18  5:59                     ` Yonghong Song
@ 2022-10-18 17:08                       ` sdf
  2022-10-18 17:17                         ` Alexei Starovoitov
  0 siblings, 1 reply; 38+ messages in thread
From: sdf @ 2022-10-18 17:08 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Martin KaFai Lau, Yosry Ahmed, Yonghong Song, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	KP Singh, Martin KaFai Lau, Tejun Heo

On 10/17, Yonghong Song wrote:


> On 10/17/22 5:52 PM, Martin KaFai Lau wrote:
> > On 10/17/22 3:16 PM, sdf@google.com wrote:
> > > On 10/17, Martin KaFai Lau wrote:
> > > > On 10/17/22 12:11 PM, Yosry Ahmed wrote:
> > > > > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev
> > > > <sdf@google.com> wrote:
> > > > > >
> > > > > > On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed
> > > > <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev
> > > > <sdf@google.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed
> > > > <yosryahmed@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On 10/13, Yonghong Song wrote:
> > > > > > > > > > > Similar to sk/inode/task storage, implement
> > > > similar cgroup local storage.
> > > > > > > > > >
> > > > > > > > > > > There already exists a local storage
> > > > implementation for cgroup-attached
> > > > > > > > > > > bpf programs.� See map type
> > > > BPF_MAP_TYPE_CGROUP_STORAGE and helper
> > > > > > > > > > > bpf_get_local_storage(). But there are use cases
> > > > such that non-cgroup
> > > > > > > > > > > attached bpf progs wants to access cgroup local
> > > > storage data. For example,
> > > > > > > > > > > tc egress prog has access to sk and cgroup. It is
> > > > possible to use
> > > > > > > > > > > sk local storage to emulate cgroup local storage
> > > > by storing data in
> > > > > > > > > > > socket.
> > > > > > > > > > > But this is a waste as it could be lots of sockets
> > > > belonging to a
> > > > > > > > > > > particular
> > > > > > > > > > > cgroup. Alternatively, a separate map can be
> > > > created with cgroup id as
> > > > > > > > > > > the key.
> > > > > > > > > > > But this will introduce additional overhead to
> > > > manipulate the new map.
> > > > > > > > > > > A cgroup local storage, similar to existing
> > > > sk/inode/task storage,
> > > > > > > > > > > should help for this use case.
> > > > > > > > > >
> > > > > > > > > > > The life-cycle of storage is managed with the
> > > > life-cycle of the
> > > > > > > > > > > cgroup struct.� i.e. the storage is destroyed
> > > > along with the owning cgroup
> > > > > > > > > > > with a callback to the bpf_cgroup_storage_free
> > > > when cgroup itself
> > > > > > > > > > > is deleted.
> > > > > > > > > >
> > > > > > > > > > > The userspace map operations can be done by using
> > > > a cgroup fd as a key
> > > > > > > > > > > passed to the lookup, update and delete operations.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [..]
> > > > > > > > > >
> > > > > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has
> > > > been used for old cgroup
> > > > > > > > > > > local
> > > > > > > > > > > storage support, the new map name
> > > > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> > > > > > > > > > > used
> > > > > > > > > > > for cgroup storage available to
> > > > non-cgroup-attached bpf programs. The two
> > > > > > > > > > > helpers are named as bpf_cgroup_local_storage_get()  
> and
> > > > > > > > > > > bpf_cgroup_local_storage_delete().
> > > > > > > > > >
> > > > > > > > > > Have you considered doing something similar to
> > > > 7d9c3427894f ("bpf: Make
> > > > > > > > > > cgroup storages shared between programs on the same
> > > > cgroup") where
> > > > > > > > > > the map changes its behavior depending on the key
> > > > size (see key_size checks
> > > > > > > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int)
> > > > for fd still
> > > > > > > > > > can be used so we can, in theory, reuse the name..
> > > > > > > > > >
> > > > > > > > > > Pros:
> > > > > > > > > > - no need for a new map name
> > > > > > > > > >
> > > > > > > > > > Cons:
> > > > > > > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already
> > > > messy; might be not a
> > > > > > > > > >���� good idea to add more stuff to it?
> > > > > > > > > >
> > > > > > > > > > But, for the very least, should we also extend
> > > > > > > > > > Documentation/bpf/map_cgroup_storage.rst to cover
> > > > the new map? We've
> > > > > > > > > > tried to keep some of the important details in there..
> > > > > > > > >
> > > > > > > > > This might be a long shot, but is it possible to
> > > > switch completely to
> > > > > > > > > this new generic cgroup storage, and for programs that
> > > > attach to
> > > > > > > > > cgroups we can still do lookups/allocations during
> > > > attachment like we
> > > > > > > > > do today? IOW, maintain the current API for cgroup
> > > > progs but switch it
> > > > > > > > > to use this new map type instead.
> > > > > > > > >
> > > > > > > > > It feels like this map type is more generic and can be
> > > > a superset of
> > > > > > > > > the existing cgroup storage, but I feel like I am
> > > > missing something.
> > > > > > > >
> > > > > > > > I feel like the biggest issue is that the existing
> > > > > > > > bpf_get_local_storage helper is guaranteed to always
> > > > return non-null
> > > > > > > > and the verifier doesn't require the programs to do null
> > > > checks on it;
> > > > > > > > the new helper might return NULL making all existing
> > > > programs fail the
> > > > > > > > verifier.
> > > > > > >
> > > > > > > What I meant is, keep the old bpf_get_local_storage helper
> > > > only for
> > > > > > > cgroup-attached programs like we have today, and add a new  
> generic
> > > > > > > bpf_cgroup_local_storage_get() helper.
> > > > > > >
> > > > > > > For cgroup-attached programs, make sure a cgroup storage  
> entry is
> > > > > > > allocated and hooked to the helper on program attach time, to  
> keep
> > > > > > > today's behavior constant.
> > > > > > >
> > > > > > > For other programs, the bpf_cgroup_local_storage_get() will  
> do the
> > > > > > > normal lookup and allocate if necessary.
> > > > > > >
> > > > > > > Does this make any sense to you?
> > > > > >
> > > > > > But then you also need to somehow mark these to make sure it's  
> not
> > > > > > possible to delete them as long as the program is
> > > > loaded/attached? Not
> > > > > > saying it's impossible, but it's a bit of a departure from the
> > > > > > existing common local storage framework used by inode/task; not  
> sure
> > > > > > whether we want to pull all this complexity in there? But we can
> > > > > > definitely try if there is a wider agreement..
> > > > >
> > > > > I agree that it's not ideal, but it feels like we are comparing  
> two
> > > > > non-ideal options anyway, I am just throwing ideas around :)
> > >
> > > > I don't think it is a good idea to marry the new
> > > > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing
> > > > BPF_MAP_TYPE_CGROUP_STORAGE in any way.� The API is very
> > > > different.  A few
> > > > have already been mentioned here.� Delete is one.� Storage
> > > > creation time is
> > > > another one.� The map key is also different.� Yes, maybe we can
> > > > reuse the
> > > > different key size concept in bpf_cgroup_storage_key in some way
> > > > but still
> > > > feel too much unnecessary quirks for the existing sk/inode/task  
> storage
> > > > users to remember.
> > >
> > > > imo, it is better to keep them separate and have a different  
> map-type.
> > > > Adding a map flag or using map extra will make it sounds like an
> > > > extension
> > > > which it is not.
> > >
> > > This part is the most confusing to me:
> > >
> > > BPF_MAP_TYPE_CGROUP_STORAGE������ bpf_get_local_storage
> > > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_local_storage_get
> > >
> > > The new helpers should probably drop 'local' name to match the
> > > task/inode ([0])?
> > > And we're left with:
> > >
> > > BPF_MAP_TYPE_CGROUP_STORAGE������ bpf_get_local_storage
> > > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_storage_get
> > >
> > > You read CGROUP_STORAGE via get_local_storage and
> > > you read CGROUP_LOCAL_STORAGE via cgroup_storage_get :-/
> >
> > Yep, agree that it is not ideal :(

> I guess I need to add more documentation to explain the difference
> of old and new map regardless of the final names.

> >
> > >
> > > That's why I'm slightly tilting towards reusing the name. At least we  
> can
> > > add a big DEPRECATED message for bpf_get_local_storage and that
> > > seems to be
> > > it? All those extra key sizes can also be deprecated, but I'm honestly
> > > not sure if anybody is using them.
> >
> > Reusing 'key_size == sizeof(int)' to mean new map type...hmm...� I have
> > been thinking about it after your suggestion in another reply since it
> > can use the BPF_MAP_TYPE_CGROUP_STORAGE name.� I wish the
> > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE was given to the
> > bpf_get_local_storage() instead because it is a better name to describe
> > what it is doing.
> >
> > hmm.... However, this feels working like a map_flags or map_extra but in
> > a more hidden way.� I am worry it will actually be more confusing and
> > also having usage surprises when there are quite many behavior
> > differences that this thread has already mentioned.� That will be hard
> > for the user to reason those API differences just because of using a
> > different key_size.
> >
> > May be going back to revisit the naming a little bit.� How about giving
> > a new and likely more correct 'BPF_MAP_TYPE_CGRP_LOCAL_STORAGE' name for
> > the existing bpf_get_local_storage() use.� Then
> >
> > '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /*
> > depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi.
> >
> > The new cgroup storage uses a shorter name "cgrp", like
> > BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()?

> This might work and the naming convention will be similar to
> existing sk/inode/task storage.

+1, CGRP_STORAGE sounds good!

> Another alternative is to name the map name as
>      BPF_MAP_TYPE_CGROUP_STORAGE2
> to indicate it is a different version of cgroup_storage map
> and the documentation should explain the difference clearly.
> This should avoid the possible confusion between
> BPF_MAP_TYPE_CGROUP_STORAGE and BPF_MAP_TYPE_CGRP_STORAGE.

> >
> > >
> > > But having a separate map also seems fine, as long as we have a patch  
> to
> > > update the existing header documentation. (and mention in
> > > Documentation/bpf/map_cgroup_storage.rst that there is a replacement?)
> > > Current bpf_get_local_storage description is too vague; let's at least
> > > mention that it works only with BPF_MAP_TYPE_CGROUP_STORAGE.
> > >
> > > 0:  
> https://lore.kernel.org/bpf/6ce7d490-f015-531f-3dbb-b6f7717f0590@meta.com/T/#mb2107250caa19a8d9ec3549a52f4a9698be99e33
> > >
> > > > > >
> > > > > > > > There might be something else I don't remember at this
> > > > point (besides
> > > > > > > > that weird per-prog_type that we'd have to emulate as  
> well)..
> > > > > > >
> > > > > > > Yeah there are things that will need to be emulated, but I
> > > > feel like
> > > > > > > we may end up with less confusing code (and less code in  
> general).
> > >
> > >
> >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-18 17:08                       ` sdf
@ 2022-10-18 17:17                         ` Alexei Starovoitov
  2022-10-18 18:08                           ` Martin KaFai Lau
  2022-10-18 23:12                           ` Andrii Nakryiko
  0 siblings, 2 replies; 38+ messages in thread
From: Alexei Starovoitov @ 2022-10-18 17:17 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Yonghong Song, Martin KaFai Lau, Yosry Ahmed, Yonghong Song, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Kernel Team,
	KP Singh, Martin KaFai Lau, Tejun Heo

On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote:
> > >
> > > '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /*
> > > depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi.
> > >
> > > The new cgroup storage uses a shorter name "cgrp", like
> > > BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()?
>
> > This might work and the naming convention will be similar to
> > existing sk/inode/task storage.
>
> +1, CGRP_STORAGE sounds good!

+1 from me as well.

Something like this ?
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 17f61338f8f8..13dcb2418847 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -922,7 +922,8 @@ enum bpf_map_type {
        BPF_MAP_TYPE_CPUMAP,
        BPF_MAP_TYPE_XSKMAP,
        BPF_MAP_TYPE_SOCKHASH,
-       BPF_MAP_TYPE_CGROUP_STORAGE,
+       BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
+       BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
        BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
        BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
        BPF_MAP_TYPE_QUEUE,
@@ -935,6 +936,7 @@ enum bpf_map_type {
        BPF_MAP_TYPE_TASK_STORAGE,
        BPF_MAP_TYPE_BLOOM_FILTER,
        BPF_MAP_TYPE_USER_RINGBUF,
+       BPF_MAP_TYPE_CGRP_STORAGE,
 };

What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ?
Probably should come up with a replacement as well?

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-18 17:17                         ` Alexei Starovoitov
@ 2022-10-18 18:08                           ` Martin KaFai Lau
  2022-10-18 18:11                             ` Yosry Ahmed
  2022-10-18 23:12                           ` Andrii Nakryiko
  1 sibling, 1 reply; 38+ messages in thread
From: Martin KaFai Lau @ 2022-10-18 18:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Yosry Ahmed, Yonghong Song, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Kernel Team,
	KP Singh, Martin KaFai Lau, Tejun Heo, Stanislav Fomichev

On 10/18/22 10:17 AM, Alexei Starovoitov wrote:
> On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote:
>>>>
>>>> '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /*
>>>> depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi.
>>>>
>>>> The new cgroup storage uses a shorter name "cgrp", like
>>>> BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()?
>>
>>> This might work and the naming convention will be similar to
>>> existing sk/inode/task storage.
>>
>> +1, CGRP_STORAGE sounds good!
> 
> +1 from me as well.
> 
> Something like this ?
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 17f61338f8f8..13dcb2418847 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -922,7 +922,8 @@ enum bpf_map_type {
>          BPF_MAP_TYPE_CPUMAP,
>          BPF_MAP_TYPE_XSKMAP,
>          BPF_MAP_TYPE_SOCKHASH,
> -       BPF_MAP_TYPE_CGROUP_STORAGE,
> +       BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
> +       BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,

+1

>          BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
>          BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
>          BPF_MAP_TYPE_QUEUE,
> @@ -935,6 +936,7 @@ enum bpf_map_type {
>          BPF_MAP_TYPE_TASK_STORAGE,
>          BPF_MAP_TYPE_BLOOM_FILTER,
>          BPF_MAP_TYPE_USER_RINGBUF,
> +       BPF_MAP_TYPE_CGRP_STORAGE,
>   };
> 
> What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ?
> Probably should come up with a replacement as well?

Yeah, need to come up with a percpu answer for it.  The percpu usage has never 
come up on the sk storage and also the later task/inode storage.  or the user is 
just getting by with an array like map's value.

May be the bpf prog can call bpf_mem_alloc() to alloc the percpu memory in the 
future and then store it as the kptr in the BPF_MAP_TYPE_CGRP_STORAGE?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-18 18:08                           ` Martin KaFai Lau
@ 2022-10-18 18:11                             ` Yosry Ahmed
  2022-10-18 18:26                               ` Yonghong Song
  0 siblings, 1 reply; 38+ messages in thread
From: Yosry Ahmed @ 2022-10-18 18:11 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Yonghong Song, Yonghong Song, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Kernel Team,
	KP Singh, Martin KaFai Lau, Tejun Heo, Stanislav Fomichev

On Tue, Oct 18, 2022 at 11:08 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/18/22 10:17 AM, Alexei Starovoitov wrote:
> > On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote:
> >>>>
> >>>> '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /*
> >>>> depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi.
> >>>>
> >>>> The new cgroup storage uses a shorter name "cgrp", like
> >>>> BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()?
> >>
> >>> This might work and the naming convention will be similar to
> >>> existing sk/inode/task storage.
> >>
> >> +1, CGRP_STORAGE sounds good!
> >
> > +1 from me as well.
> >
> > Something like this ?
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 17f61338f8f8..13dcb2418847 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -922,7 +922,8 @@ enum bpf_map_type {
> >          BPF_MAP_TYPE_CPUMAP,
> >          BPF_MAP_TYPE_XSKMAP,
> >          BPF_MAP_TYPE_SOCKHASH,
> > -       BPF_MAP_TYPE_CGROUP_STORAGE,
> > +       BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
> > +       BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
>
> +1
>
> >          BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
> >          BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
> >          BPF_MAP_TYPE_QUEUE,
> > @@ -935,6 +936,7 @@ enum bpf_map_type {
> >          BPF_MAP_TYPE_TASK_STORAGE,
> >          BPF_MAP_TYPE_BLOOM_FILTER,
> >          BPF_MAP_TYPE_USER_RINGBUF,
> > +       BPF_MAP_TYPE_CGRP_STORAGE,
> >   };
> >
> > What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ?
> > Probably should come up with a replacement as well?
>
> Yeah, need to come up with a percpu answer for it.  The percpu usage has never
> come up on the sk storage and also the later task/inode storage.  or the user is
> just getting by with an array like map's value.
>
> May be the bpf prog can call bpf_mem_alloc() to alloc the percpu memory in the
> future and then store it as the kptr in the BPF_MAP_TYPE_CGRP_STORAGE?

A percpu cgroup storage would be very beneficial for cgroup statistics
collection, things like the selftest in
tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
currently uses a percpu hashmap indexed by cgroup id, so using a
percpu cgroup storage instead would be a nice upgrade.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-18 18:11                             ` Yosry Ahmed
@ 2022-10-18 18:26                               ` Yonghong Song
  0 siblings, 0 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-18 18:26 UTC (permalink / raw)
  To: Yosry Ahmed, Martin KaFai Lau
  Cc: Alexei Starovoitov, Yonghong Song, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Kernel Team, KP Singh,
	Martin KaFai Lau, Tejun Heo, Stanislav Fomichev



On 10/18/22 11:11 AM, Yosry Ahmed wrote:
> On Tue, Oct 18, 2022 at 11:08 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/18/22 10:17 AM, Alexei Starovoitov wrote:
>>> On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote:
>>>>>>
>>>>>> '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /*
>>>>>> depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi.
>>>>>>
>>>>>> The new cgroup storage uses a shorter name "cgrp", like
>>>>>> BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()?
>>>>
>>>>> This might work and the naming convention will be similar to
>>>>> existing sk/inode/task storage.
>>>>
>>>> +1, CGRP_STORAGE sounds good!
>>>
>>> +1 from me as well.
>>>
>>> Something like this ?
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 17f61338f8f8..13dcb2418847 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -922,7 +922,8 @@ enum bpf_map_type {
>>>           BPF_MAP_TYPE_CPUMAP,
>>>           BPF_MAP_TYPE_XSKMAP,
>>>           BPF_MAP_TYPE_SOCKHASH,
>>> -       BPF_MAP_TYPE_CGROUP_STORAGE,
>>> +       BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
>>> +       BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
>>
>> +1
>>
>>>           BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
>>>           BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
>>>           BPF_MAP_TYPE_QUEUE,
>>> @@ -935,6 +936,7 @@ enum bpf_map_type {
>>>           BPF_MAP_TYPE_TASK_STORAGE,
>>>           BPF_MAP_TYPE_BLOOM_FILTER,
>>>           BPF_MAP_TYPE_USER_RINGBUF,
>>> +       BPF_MAP_TYPE_CGRP_STORAGE,
>>>    };

Sounds good to me. Will do this in the next revision.

>>>
>>> What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ?
>>> Probably should come up with a replacement as well?
>>
>> Yeah, need to come up with a percpu answer for it.  The percpu usage has never
>> come up on the sk storage and also the later task/inode storage.  or the user is
>> just getting by with an array like map's value.
>>
>> May be the bpf prog can call bpf_mem_alloc() to alloc the percpu memory in the
>> future and then store it as the kptr in the BPF_MAP_TYPE_CGRP_STORAGE?
> 
> A percpu cgroup storage would be very beneficial for cgroup statistics
> collection, things like the selftest in
> tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> currently uses a percpu hashmap indexed by cgroup id, so using a
> percpu cgroup storage instead would be a nice upgrade.

Indeed, agree. For cgroup storage, we could have a per-cpu version
for the new mechanism so it can replace the old one as well.
Will look into this after non per-cpu version is done.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-18 17:17                         ` Alexei Starovoitov
  2022-10-18 18:08                           ` Martin KaFai Lau
@ 2022-10-18 23:12                           ` Andrii Nakryiko
  1 sibling, 0 replies; 38+ messages in thread
From: Andrii Nakryiko @ 2022-10-18 23:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, Yonghong Song, Martin KaFai Lau, Yosry Ahmed,
	Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Kernel Team, KP Singh, Martin KaFai Lau,
	Tejun Heo

On Tue, Oct 18, 2022 at 10:18 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote:
> > > >
> > > > '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /*
> > > > depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi.
> > > >
> > > > The new cgroup storage uses a shorter name "cgrp", like
> > > > BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()?
> >
> > > This might work and the naming convention will be similar to
> > > existing sk/inode/task storage.
> >
> > +1, CGRP_STORAGE sounds good!
>
> +1 from me as well.

it's totally bikeshedding zone :) but isn't CG_STORAGE just as
recognizable but easier to mentally read as well? Like SK for socket,
instead of SCKT

>
> Something like this ?
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 17f61338f8f8..13dcb2418847 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -922,7 +922,8 @@ enum bpf_map_type {
>         BPF_MAP_TYPE_CPUMAP,
>         BPF_MAP_TYPE_XSKMAP,
>         BPF_MAP_TYPE_SOCKHASH,
> -       BPF_MAP_TYPE_CGROUP_STORAGE,
> +       BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
> +       BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
>         BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
>         BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
>         BPF_MAP_TYPE_QUEUE,
> @@ -935,6 +936,7 @@ enum bpf_map_type {
>         BPF_MAP_TYPE_TASK_STORAGE,
>         BPF_MAP_TYPE_BLOOM_FILTER,
>         BPF_MAP_TYPE_USER_RINGBUF,
> +       BPF_MAP_TYPE_CGRP_STORAGE,
>  };
>
> What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ?
> Probably should come up with a replacement as well?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:47         ` Yosry Ahmed
  2022-10-17 19:07           ` Stanislav Fomichev
@ 2022-10-17 20:15           ` Yonghong Song
  2022-10-17 20:18             ` Yosry Ahmed
  1 sibling, 1 reply; 38+ messages in thread
From: Yonghong Song @ 2022-10-17 20:15 UTC (permalink / raw)
  To: Yosry Ahmed, Stanislav Fomichev
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo



On 10/17/22 11:47 AM, Yosry Ahmed wrote:
> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>
>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
>>>>
>>>> On 10/13, Yonghong Song wrote:
>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage.
>>>>
>>>>> There already exists a local storage implementation for cgroup-attached
>>>>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>>>>> attached bpf progs wants to access cgroup local storage data. For example,
>>>>> tc egress prog has access to sk and cgroup. It is possible to use
>>>>> sk local storage to emulate cgroup local storage by storing data in
>>>>> socket.
>>>>> But this is a waste as it could be lots of sockets belonging to a
>>>>> particular
>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as
>>>>> the key.
>>>>> But this will introduce additional overhead to manipulate the new map.
>>>>> A cgroup local storage, similar to existing sk/inode/task storage,
>>>>> should help for this use case.
>>>>
>>>>> The life-cycle of storage is managed with the life-cycle of the
>>>>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself
>>>>> is deleted.
>>>>
>>>>> The userspace map operations can be done by using a cgroup fd as a key
>>>>> passed to the lookup, update and delete operations.
>>>>
>>>>
>>>> [..]
>>>>
>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
>>>>> local
>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
>>>>> used
>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two
>>>>> helpers are named as bpf_cgroup_local_storage_get() and
>>>>> bpf_cgroup_local_storage_delete().
>>>>
>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
>>>> cgroup storages shared between programs on the same cgroup") where
>>>> the map changes its behavior depending on the key size (see key_size checks
>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
>>>> can be used so we can, in theory, reuse the name..
>>>>
>>>> Pros:
>>>> - no need for a new map name
>>>>
>>>> Cons:
>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
>>>>     good idea to add more stuff to it?
>>>>
>>>> But, for the very least, should we also extend
>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
>>>> tried to keep some of the important details in there..
>>>
>>> This might be a long shot, but is it possible to switch completely to
>>> this new generic cgroup storage, and for programs that attach to
>>> cgroups we can still do lookups/allocations during attachment like we
>>> do today? IOW, maintain the current API for cgroup progs but switch it
>>> to use this new map type instead.
>>>
>>> It feels like this map type is more generic and can be a superset of
>>> the existing cgroup storage, but I feel like I am missing something.
>>
>> I feel like the biggest issue is that the existing
>> bpf_get_local_storage helper is guaranteed to always return non-null
>> and the verifier doesn't require the programs to do null checks on it;
>> the new helper might return NULL making all existing programs fail the
>> verifier.
> 
> What I meant is, keep the old bpf_get_local_storage helper only for
> cgroup-attached programs like we have today, and add a new generic
> bpf_cgroup_local_storage_get() helper.
> 
> For cgroup-attached programs, make sure a cgroup storage entry is
> allocated and hooked to the helper on program attach time, to keep
> today's behavior constant.
> 
> For other programs, the bpf_cgroup_local_storage_get() will do the
> normal lookup and allocate if necessary.
> 
> Does this make any sense to you?

Right. This is what I plan to do. The map will add a flag to
distinguish the old and new behavior.

> 
>>
>> There might be something else I don't remember at this point (besides
>> that weird per-prog_type that we'd have to emulate as well)..
> 
> Yeah there are things that will need to be emulated, but I feel like
> we may end up with less confusing code (and less code in general).
> 
>>
>>>>
>>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>> ---
>>>>>    include/linux/bpf.h             |   3 +
>>>>>    include/linux/bpf_types.h       |   1 +
>>>>>    include/linux/cgroup-defs.h     |   4 +
>>>>>    include/uapi/linux/bpf.h        |  39 +++++
>>>>>    kernel/bpf/Makefile             |   2 +-
>>>>>    kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
>>>>>    kernel/bpf/helpers.c            |   6 +
>>>>>    kernel/bpf/syscall.c            |   3 +-
>>>>>    kernel/bpf/verifier.c           |  14 +-
>>>>>    kernel/cgroup/cgroup.c          |   4 +
>>>>>    kernel/trace/bpf_trace.c        |   4 +
>>>>>    scripts/bpf_doc.py              |   2 +
>>>>>    tools/include/uapi/linux/bpf.h  |  39 +++++
>>>>>    13 files changed, 398 insertions(+), 3 deletions(-)
>>>>>    create mode 100644 kernel/bpf/bpf_cgroup_storage.c
[...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 20:15           ` Yonghong Song
@ 2022-10-17 20:18             ` Yosry Ahmed
  0 siblings, 0 replies; 38+ messages in thread
From: Yosry Ahmed @ 2022-10-17 20:18 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Stanislav Fomichev, Yonghong Song, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau, Tejun Heo

On Mon, Oct 17, 2022 at 1:15 PM Yonghong Song <yhs@meta.com> wrote:
>
>
>
> On 10/17/22 11:47 AM, Yosry Ahmed wrote:
> > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote:
> >>
> >> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>
> >>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> >>>>
> >>>> On 10/13, Yonghong Song wrote:
> >>>>> Similar to sk/inode/task storage, implement similar cgroup local storage.
> >>>>
> >>>>> There already exists a local storage implementation for cgroup-attached
> >>>>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> >>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup
> >>>>> attached bpf progs wants to access cgroup local storage data. For example,
> >>>>> tc egress prog has access to sk and cgroup. It is possible to use
> >>>>> sk local storage to emulate cgroup local storage by storing data in
> >>>>> socket.
> >>>>> But this is a waste as it could be lots of sockets belonging to a
> >>>>> particular
> >>>>> cgroup. Alternatively, a separate map can be created with cgroup id as
> >>>>> the key.
> >>>>> But this will introduce additional overhead to manipulate the new map.
> >>>>> A cgroup local storage, similar to existing sk/inode/task storage,
> >>>>> should help for this use case.
> >>>>
> >>>>> The life-cycle of storage is managed with the life-cycle of the
> >>>>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> >>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself
> >>>>> is deleted.
> >>>>
> >>>>> The userspace map operations can be done by using a cgroup fd as a key
> >>>>> passed to the lookup, update and delete operations.
> >>>>
> >>>>
> >>>> [..]
> >>>>
> >>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> >>>>> local
> >>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> >>>>> used
> >>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two
> >>>>> helpers are named as bpf_cgroup_local_storage_get() and
> >>>>> bpf_cgroup_local_storage_delete().
> >>>>
> >>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> >>>> cgroup storages shared between programs on the same cgroup") where
> >>>> the map changes its behavior depending on the key size (see key_size checks
> >>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> >>>> can be used so we can, in theory, reuse the name..
> >>>>
> >>>> Pros:
> >>>> - no need for a new map name
> >>>>
> >>>> Cons:
> >>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> >>>>     good idea to add more stuff to it?
> >>>>
> >>>> But, for the very least, should we also extend
> >>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> >>>> tried to keep some of the important details in there..
> >>>
> >>> This might be a long shot, but is it possible to switch completely to
> >>> this new generic cgroup storage, and for programs that attach to
> >>> cgroups we can still do lookups/allocations during attachment like we
> >>> do today? IOW, maintain the current API for cgroup progs but switch it
> >>> to use this new map type instead.
> >>>
> >>> It feels like this map type is more generic and can be a superset of
> >>> the existing cgroup storage, but I feel like I am missing something.
> >>
> >> I feel like the biggest issue is that the existing
> >> bpf_get_local_storage helper is guaranteed to always return non-null
> >> and the verifier doesn't require the programs to do null checks on it;
> >> the new helper might return NULL making all existing programs fail the
> >> verifier.
> >
> > What I meant is, keep the old bpf_get_local_storage helper only for
> > cgroup-attached programs like we have today, and add a new generic
> > bpf_cgroup_local_storage_get() helper.
> >
> > For cgroup-attached programs, make sure a cgroup storage entry is
> > allocated and hooked to the helper on program attach time, to keep
> > today's behavior constant.
> >
> > For other programs, the bpf_cgroup_local_storage_get() will do the
> > normal lookup and allocate if necessary.
> >
> > Does this make any sense to you?
>
> Right. This is what I plan to do. The map will add a flag to
> distinguish the old and new behavior.
>

This might not make any sense, but is this doable without a flag?
Basically extend the new map type so that it has some special
behaviors for cgroup attached programs (allocate memory on program
attach, bpf_get_local_storage() automatically gets entry for the
attached cgroup, etc).

> >
> >>
> >> There might be something else I don't remember at this point (besides
> >> that weird per-prog_type that we'd have to emulate as well)..
> >
> > Yeah there are things that will need to be emulated, but I feel like
> > we may end up with less confusing code (and less code in general).
> >
> >>
> >>>>
> >>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
> >>>>> ---
> >>>>>    include/linux/bpf.h             |   3 +
> >>>>>    include/linux/bpf_types.h       |   1 +
> >>>>>    include/linux/cgroup-defs.h     |   4 +
> >>>>>    include/uapi/linux/bpf.h        |  39 +++++
> >>>>>    kernel/bpf/Makefile             |   2 +-
> >>>>>    kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
> >>>>>    kernel/bpf/helpers.c            |   6 +
> >>>>>    kernel/bpf/syscall.c            |   3 +-
> >>>>>    kernel/bpf/verifier.c           |  14 +-
> >>>>>    kernel/cgroup/cgroup.c          |   4 +
> >>>>>    kernel/trace/bpf_trace.c        |   4 +
> >>>>>    scripts/bpf_doc.py              |   2 +
> >>>>>    tools/include/uapi/linux/bpf.h  |  39 +++++
> >>>>>    13 files changed, 398 insertions(+), 3 deletions(-)
> >>>>>    create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> [...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:43       ` Stanislav Fomichev
  2022-10-17 18:47         ` Yosry Ahmed
@ 2022-10-17 20:13         ` Yonghong Song
  1 sibling, 0 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-17 20:13 UTC (permalink / raw)
  To: Stanislav Fomichev, Yosry Ahmed
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo



On 10/17/22 11:43 AM, Stanislav Fomichev wrote:
> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
>>>
>>> On 10/13, Yonghong Song wrote:
>>>> Similar to sk/inode/task storage, implement similar cgroup local storage.
>>>
>>>> There already exists a local storage implementation for cgroup-attached
>>>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>>>> attached bpf progs wants to access cgroup local storage data. For example,
>>>> tc egress prog has access to sk and cgroup. It is possible to use
>>>> sk local storage to emulate cgroup local storage by storing data in
>>>> socket.
>>>> But this is a waste as it could be lots of sockets belonging to a
>>>> particular
>>>> cgroup. Alternatively, a separate map can be created with cgroup id as
>>>> the key.
>>>> But this will introduce additional overhead to manipulate the new map.
>>>> A cgroup local storage, similar to existing sk/inode/task storage,
>>>> should help for this use case.
>>>
>>>> The life-cycle of storage is managed with the life-cycle of the
>>>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself
>>>> is deleted.
>>>
>>>> The userspace map operations can be done by using a cgroup fd as a key
>>>> passed to the lookup, update and delete operations.
>>>
>>>
>>> [..]
>>>
>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
>>>> local
>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
>>>> used
>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two
>>>> helpers are named as bpf_cgroup_local_storage_get() and
>>>> bpf_cgroup_local_storage_delete().
>>>
>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
>>> cgroup storages shared between programs on the same cgroup") where
>>> the map changes its behavior depending on the key size (see key_size checks
>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
>>> can be used so we can, in theory, reuse the name..
>>>
>>> Pros:
>>> - no need for a new map name
>>>
>>> Cons:
>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
>>>     good idea to add more stuff to it?
>>>
>>> But, for the very least, should we also extend
>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
>>> tried to keep some of the important details in there..
>>
>> This might be a long shot, but is it possible to switch completely to
>> this new generic cgroup storage, and for programs that attach to
>> cgroups we can still do lookups/allocations during attachment like we
>> do today? IOW, maintain the current API for cgroup progs but switch it
>> to use this new map type instead.
>>
>> It feels like this map type is more generic and can be a superset of
>> the existing cgroup storage, but I feel like I am missing something.
> 
> I feel like the biggest issue is that the existing
> bpf_get_local_storage helper is guaranteed to always return non-null
> and the verifier doesn't require the programs to do null checks on it;
> the new helper might return NULL making all existing programs fail the
> verifier.

Ya, this is indeed the case. Another difference is the new helper
is able to access data from different cgroups. and the old helper
can only access data from *current* cgroup.

> 
> There might be something else I don't remember at this point (besides
> that weird per-prog_type that we'd have to emulate as well)..
> 
>>>
>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>> ---
>>>>    include/linux/bpf.h             |   3 +
>>>>    include/linux/bpf_types.h       |   1 +
>>>>    include/linux/cgroup-defs.h     |   4 +
>>>>    include/uapi/linux/bpf.h        |  39 +++++
>>>>    kernel/bpf/Makefile             |   2 +-
>>>>    kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
>>>>    kernel/bpf/helpers.c            |   6 +
>>>>    kernel/bpf/syscall.c            |   3 +-
>>>>    kernel/bpf/verifier.c           |  14 +-
>>>>    kernel/cgroup/cgroup.c          |   4 +
>>>>    kernel/trace/bpf_trace.c        |   4 +
>>>>    scripts/bpf_doc.py              |   2 +
>>>>    tools/include/uapi/linux/bpf.h  |  39 +++++
>>>>    13 files changed, 398 insertions(+), 3 deletions(-)
>>>>    create mode 100644 kernel/bpf/bpf_cgroup_storage.c
[...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:25     ` Yosry Ahmed
  2022-10-17 18:43       ` Stanislav Fomichev
@ 2022-10-17 20:10       ` Yonghong Song
  2022-10-17 20:14         ` Yosry Ahmed
  1 sibling, 1 reply; 38+ messages in thread
From: Yonghong Song @ 2022-10-17 20:10 UTC (permalink / raw)
  To: Yosry Ahmed, sdf
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo



On 10/17/22 11:25 AM, Yosry Ahmed wrote:
> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
>>
>> On 10/13, Yonghong Song wrote:
>>> Similar to sk/inode/task storage, implement similar cgroup local storage.
>>
>>> There already exists a local storage implementation for cgroup-attached
>>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>>> attached bpf progs wants to access cgroup local storage data. For example,
>>> tc egress prog has access to sk and cgroup. It is possible to use
>>> sk local storage to emulate cgroup local storage by storing data in
>>> socket.
>>> But this is a waste as it could be lots of sockets belonging to a
>>> particular
>>> cgroup. Alternatively, a separate map can be created with cgroup id as
>>> the key.
>>> But this will introduce additional overhead to manipulate the new map.
>>> A cgroup local storage, similar to existing sk/inode/task storage,
>>> should help for this use case.
>>
>>> The life-cycle of storage is managed with the life-cycle of the
>>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
>>> with a callback to the bpf_cgroup_storage_free when cgroup itself
>>> is deleted.
>>
>>> The userspace map operations can be done by using a cgroup fd as a key
>>> passed to the lookup, update and delete operations.
>>
>>
>> [..]
>>
>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
>>> local
>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
>>> used
>>> for cgroup storage available to non-cgroup-attached bpf programs. The two
>>> helpers are named as bpf_cgroup_local_storage_get() and
>>> bpf_cgroup_local_storage_delete().
>>
>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
>> cgroup storages shared between programs on the same cgroup") where
>> the map changes its behavior depending on the key size (see key_size checks
>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
>> can be used so we can, in theory, reuse the name..
>>
>> Pros:
>> - no need for a new map name
>>
>> Cons:
>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
>>     good idea to add more stuff to it?
>>
>> But, for the very least, should we also extend
>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
>> tried to keep some of the important details in there..
> 
> This might be a long shot, but is it possible to switch completely to
> this new generic cgroup storage, and for programs that attach to
> cgroups we can still do lookups/allocations during attachment like we
> do today? IOW, maintain the current API for cgroup progs but switch it
> to use this new map type instead.

Right, cgroup attach/detach should not be impacted by this patch.

> 
> It feels like this map type is more generic and can be a superset of
> the existing cgroup storage, but I feel like I am missing something.

One difference is old way cgroup local storage allocates the memory
at map creation time, and the new way allocates the memory at runtime
when get/update helper is called.

> 
>>
>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>> ---
>>>    include/linux/bpf.h             |   3 +
>>>    include/linux/bpf_types.h       |   1 +
>>>    include/linux/cgroup-defs.h     |   4 +
>>>    include/uapi/linux/bpf.h        |  39 +++++
>>>    kernel/bpf/Makefile             |   2 +-
>>>    kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
>>>    kernel/bpf/helpers.c            |   6 +
>>>    kernel/bpf/syscall.c            |   3 +-
>>>    kernel/bpf/verifier.c           |  14 +-
>>>    kernel/cgroup/cgroup.c          |   4 +
>>>    kernel/trace/bpf_trace.c        |   4 +
>>>    scripts/bpf_doc.py              |   2 +
>>>    tools/include/uapi/linux/bpf.h  |  39 +++++
>>>    13 files changed, 398 insertions(+), 3 deletions(-)
>>>    create mode 100644 kernel/bpf/bpf_cgroup_storage.c
>>
[...]
>>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>>> index 341c94f208f4..b02693f51978 100644
>>> --- a/kernel/bpf/Makefile
>>> +++ b/kernel/bpf/Makefile
>>> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
>>>    obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
>>>    endif
>>>    ifeq ($(CONFIG_CGROUPS),y)
>>> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
>>> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
>>>    endif
>>>    obj-$(CONFIG_CGROUP_BPF) += cgroup.o
>>>    ifeq ($(CONFIG_INET),y)
>>> diff --git a/kernel/bpf/bpf_cgroup_storage.c
>>> b/kernel/bpf/bpf_cgroup_storage.c
>>> new file mode 100644
>>> index 000000000000..9974784822da
>>> --- /dev/null
>>> +++ b/kernel/bpf/bpf_cgroup_storage.c
>>> @@ -0,0 +1,280 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
>>> + */
>>> +
>>> +#include <linux/types.h>
>>> +#include <linux/bpf.h>
>>> +#include <linux/bpf_local_storage.h>
>>> +#include <uapi/linux/btf.h>
>>> +#include <linux/btf_ids.h>
>>> +
>>> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
>>> +
>>> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
>>> +
>>> +static void bpf_cgroup_storage_lock(void)
>>> +{
>>> +     migrate_disable();
>>> +     this_cpu_inc(bpf_cgroup_storage_busy);
>>> +}
>>> +
>>> +static void bpf_cgroup_storage_unlock(void)
>>> +{
>>> +     this_cpu_dec(bpf_cgroup_storage_busy);
>>> +     migrate_enable();
>>> +}
>>> +
>>> +static bool bpf_cgroup_storage_trylock(void)
>>> +{
>>> +     migrate_disable();
>>> +     if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
>>> +             this_cpu_dec(bpf_cgroup_storage_busy);
>>> +             migrate_enable();
>>> +             return false;
>>> +     }
>>> +     return true;
>>> +}
>>
>> Task storage has lock/unlock/trylock; inode storage doesn't; why does
>> cgroup need it as well?

I think so. the new cgroup local storage might be used in fentry/fexit 
programs which could cause recursion.

>>
>>> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
>>> +{
>>> +     struct cgroup *cg = owner;
>>> +
>>> +     return &cg->bpf_cgroup_storage;
>>> +}
>>> +
>>> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
>>> +{
>>> +     struct bpf_local_storage *local_storage;
>>> +     struct bpf_local_storage_elem *selem;
>>> +     bool free_cgroup_storage = false;
>>> +     struct hlist_node *n;
>>> +     unsigned long flags;
>>> +
>>> +     rcu_read_lock();
>>> +     local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
>>> +     if (!local_storage) {
>>> +             rcu_read_unlock();
>>> +             return;
>>> +     }
>>> +
>>> +     /* Neither the bpf_prog nor the bpf-map's syscall
>>> +      * could be modifying the local_storage->list now.
>>> +      * Thus, no elem can be added-to or deleted-from the
>>> +      * local_storage->list by the bpf_prog or by the bpf-map's syscall.
>>> +      *
>>> +      * It is racing with bpf_local_storage_map_free() alone
>>> +      * when unlinking elem from the local_storage->list and
>>> +      * the map's bucket->list.
>>> +      */
>>> +     bpf_cgroup_storage_lock();
>>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
>>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>>> +             bpf_selem_unlink_map(selem);
>>> +             free_cgroup_storage =
>>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>>> +     }
>>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>>> +     bpf_cgroup_storage_unlock();
>>> +     rcu_read_unlock();
>>> +
>>> +     /* free_cgroup_storage should always be true as long as
>>> +      * local_storage->list was non-empty.
>>> +      */
>>> +     if (free_cgroup_storage)
>>> +             kfree_rcu(local_storage, rcu);
>>> +}
>>
>>> +static struct bpf_local_storage_data *
>>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool
>>> cacheit_lockit)
>>> +{
>>> +     struct bpf_local_storage *cgroup_storage;
>>> +     struct bpf_local_storage_map *smap;
>>> +
>>> +     cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
>>> +                                            bpf_rcu_lock_held());
>>> +     if (!cgroup_storage)
>>> +             return NULL;
>>> +
>>> +     smap = (struct bpf_local_storage_map *)map;
>>> +     return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
>>> +}
>>> +
>>> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void
>>> *key)
>>> +{
>>> +     struct bpf_local_storage_data *sdata;
>>> +     struct cgroup *cgroup;
>>> +     int fd;
>>> +
>>> +     fd = *(int *)key;
>>> +     cgroup = cgroup_get_from_fd(fd);
>>> +     if (IS_ERR(cgroup))
>>> +             return ERR_CAST(cgroup);
>>> +
>>> +     bpf_cgroup_storage_lock();
>>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
>>> +     bpf_cgroup_storage_unlock();
>>> +     cgroup_put(cgroup);
>>> +     return sdata ? sdata->data : NULL;
>>> +}
>>
>> A lot of the above (free/lookup) seems to be copy-pasted from the task
>> storage;
>> any point in trying to generalize the common parts?

That is true. Let me think about this.

>>
>>> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
>>> +                                       void *value, u64 map_flags)
>>> +{
>>> +     struct bpf_local_storage_data *sdata;
>>> +     struct cgroup *cgroup;
>>> +     int err, fd;
>>> +
>>> +     fd = *(int *)key;
>>> +     cgroup = cgroup_get_from_fd(fd);
>>> +     if (IS_ERR(cgroup))
>>> +             return PTR_ERR(cgroup);
>>> +
>>> +     bpf_cgroup_storage_lock();
>>> +     sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
>>> *)map,
>>> +                                      value, map_flags, GFP_ATOMIC);
>>> +     bpf_cgroup_storage_unlock();
>>> +     err = PTR_ERR_OR_ZERO(sdata);
>>> +     cgroup_put(cgroup);
>>> +     return err;
>>> +}
>>> +
[...]
>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>> index 8ad2c267ff47..2fa2c950c7fb 100644
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
>>>                put_css_set_locked(cset->dom_cset);
>>>        }
>>
>>> +#ifdef CONFIG_BPF_SYSCALL
>>> +     bpf_local_cgroup_storage_free(cset->dfl_cgrp);
>>> +#endif
>>> +
> 
> I am confused about this freeing site. It seems like this path is for
> freeing css_set's of task_structs, not for freeing the cgroup itself.
> Wouldn't we want to free the local storage when we free the cgroup
> itself? Somewhere like css_free_rwork_fn()? or did I completely miss
> the point here?

Thanks for suggestions here. To be honest, I am not sure whether this
location is correct or not. I will look at css_free_rwork_fn() which
might be a good place.

> 
>>>        kfree_rcu(cset, rcu_head);
>>>    }
>>
[...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 20:10       ` Yonghong Song
@ 2022-10-17 20:14         ` Yosry Ahmed
  2022-10-17 20:29           ` Yonghong Song
  0 siblings, 1 reply; 38+ messages in thread
From: Yosry Ahmed @ 2022-10-17 20:14 UTC (permalink / raw)
  To: Yonghong Song
  Cc: sdf, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo

On Mon, Oct 17, 2022 at 1:10 PM Yonghong Song <yhs@meta.com> wrote:
>
>
>
> On 10/17/22 11:25 AM, Yosry Ahmed wrote:
> > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
> >>
> >> On 10/13, Yonghong Song wrote:
> >>> Similar to sk/inode/task storage, implement similar cgroup local storage.
> >>
> >>> There already exists a local storage implementation for cgroup-attached
> >>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> >>> bpf_get_local_storage(). But there are use cases such that non-cgroup
> >>> attached bpf progs wants to access cgroup local storage data. For example,
> >>> tc egress prog has access to sk and cgroup. It is possible to use
> >>> sk local storage to emulate cgroup local storage by storing data in
> >>> socket.
> >>> But this is a waste as it could be lots of sockets belonging to a
> >>> particular
> >>> cgroup. Alternatively, a separate map can be created with cgroup id as
> >>> the key.
> >>> But this will introduce additional overhead to manipulate the new map.
> >>> A cgroup local storage, similar to existing sk/inode/task storage,
> >>> should help for this use case.
> >>
> >>> The life-cycle of storage is managed with the life-cycle of the
> >>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> >>> with a callback to the bpf_cgroup_storage_free when cgroup itself
> >>> is deleted.
> >>
> >>> The userspace map operations can be done by using a cgroup fd as a key
> >>> passed to the lookup, update and delete operations.
> >>
> >>
> >> [..]
> >>
> >>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> >>> local
> >>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> >>> used
> >>> for cgroup storage available to non-cgroup-attached bpf programs. The two
> >>> helpers are named as bpf_cgroup_local_storage_get() and
> >>> bpf_cgroup_local_storage_delete().
> >>
> >> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> >> cgroup storages shared between programs on the same cgroup") where
> >> the map changes its behavior depending on the key size (see key_size checks
> >> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> >> can be used so we can, in theory, reuse the name..
> >>
> >> Pros:
> >> - no need for a new map name
> >>
> >> Cons:
> >> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> >>     good idea to add more stuff to it?
> >>
> >> But, for the very least, should we also extend
> >> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> >> tried to keep some of the important details in there..
> >
> > This might be a long shot, but is it possible to switch completely to
> > this new generic cgroup storage, and for programs that attach to
> > cgroups we can still do lookups/allocations during attachment like we
> > do today? IOW, maintain the current API for cgroup progs but switch it
> > to use this new map type instead.
>
> Right, cgroup attach/detach should not be impacted by this patch.
>
> >
> > It feels like this map type is more generic and can be a superset of
> > the existing cgroup storage, but I feel like I am missing something.
>
> One difference is old way cgroup local storage allocates the memory
> at map creation time, and the new way allocates the memory at runtime
> when get/update helper is called.
>

IIUC the old cgroup local storage allocates memory when a program is
attached. We can have the same behavior with the new map type, right?
When a program is attached to a cgroup, allocate the memory, otherwise
it is allocated at run time. Does this make sense?

> >
> >>
> >>> Signed-off-by: Yonghong Song <yhs@fb.com>
> >>> ---
> >>>    include/linux/bpf.h             |   3 +
> >>>    include/linux/bpf_types.h       |   1 +
> >>>    include/linux/cgroup-defs.h     |   4 +
> >>>    include/uapi/linux/bpf.h        |  39 +++++
> >>>    kernel/bpf/Makefile             |   2 +-
> >>>    kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
> >>>    kernel/bpf/helpers.c            |   6 +
> >>>    kernel/bpf/syscall.c            |   3 +-
> >>>    kernel/bpf/verifier.c           |  14 +-
> >>>    kernel/cgroup/cgroup.c          |   4 +
> >>>    kernel/trace/bpf_trace.c        |   4 +
> >>>    scripts/bpf_doc.py              |   2 +
> >>>    tools/include/uapi/linux/bpf.h  |  39 +++++
> >>>    13 files changed, 398 insertions(+), 3 deletions(-)
> >>>    create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> >>
> [...]
> >>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> >>> index 341c94f208f4..b02693f51978 100644
> >>> --- a/kernel/bpf/Makefile
> >>> +++ b/kernel/bpf/Makefile
> >>> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> >>>    obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> >>>    endif
> >>>    ifeq ($(CONFIG_CGROUPS),y)
> >>> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> >>> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
> >>>    endif
> >>>    obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> >>>    ifeq ($(CONFIG_INET),y)
> >>> diff --git a/kernel/bpf/bpf_cgroup_storage.c
> >>> b/kernel/bpf/bpf_cgroup_storage.c
> >>> new file mode 100644
> >>> index 000000000000..9974784822da
> >>> --- /dev/null
> >>> +++ b/kernel/bpf/bpf_cgroup_storage.c
> >>> @@ -0,0 +1,280 @@
> >>> +// SPDX-License-Identifier: GPL-2.0
> >>> +/*
> >>> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> >>> + */
> >>> +
> >>> +#include <linux/types.h>
> >>> +#include <linux/bpf.h>
> >>> +#include <linux/bpf_local_storage.h>
> >>> +#include <uapi/linux/btf.h>
> >>> +#include <linux/btf_ids.h>
> >>> +
> >>> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> >>> +
> >>> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> >>> +
> >>> +static void bpf_cgroup_storage_lock(void)
> >>> +{
> >>> +     migrate_disable();
> >>> +     this_cpu_inc(bpf_cgroup_storage_busy);
> >>> +}
> >>> +
> >>> +static void bpf_cgroup_storage_unlock(void)
> >>> +{
> >>> +     this_cpu_dec(bpf_cgroup_storage_busy);
> >>> +     migrate_enable();
> >>> +}
> >>> +
> >>> +static bool bpf_cgroup_storage_trylock(void)
> >>> +{
> >>> +     migrate_disable();
> >>> +     if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> >>> +             this_cpu_dec(bpf_cgroup_storage_busy);
> >>> +             migrate_enable();
> >>> +             return false;
> >>> +     }
> >>> +     return true;
> >>> +}
> >>
> >> Task storage has lock/unlock/trylock; inode storage doesn't; why does
> >> cgroup need it as well?
>
> I think so. the new cgroup local storage might be used in fentry/fexit
> programs which could cause recursion.
>
> >>
> >>> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> >>> +{
> >>> +     struct cgroup *cg = owner;
> >>> +
> >>> +     return &cg->bpf_cgroup_storage;
> >>> +}
> >>> +
> >>> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> >>> +{
> >>> +     struct bpf_local_storage *local_storage;
> >>> +     struct bpf_local_storage_elem *selem;
> >>> +     bool free_cgroup_storage = false;
> >>> +     struct hlist_node *n;
> >>> +     unsigned long flags;
> >>> +
> >>> +     rcu_read_lock();
> >>> +     local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> >>> +     if (!local_storage) {
> >>> +             rcu_read_unlock();
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     /* Neither the bpf_prog nor the bpf-map's syscall
> >>> +      * could be modifying the local_storage->list now.
> >>> +      * Thus, no elem can be added-to or deleted-from the
> >>> +      * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> >>> +      *
> >>> +      * It is racing with bpf_local_storage_map_free() alone
> >>> +      * when unlinking elem from the local_storage->list and
> >>> +      * the map's bucket->list.
> >>> +      */
> >>> +     bpf_cgroup_storage_lock();
> >>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> >>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> >>> +             bpf_selem_unlink_map(selem);
> >>> +             free_cgroup_storage =
> >>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> >>> +     }
> >>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> >>> +     bpf_cgroup_storage_unlock();
> >>> +     rcu_read_unlock();
> >>> +
> >>> +     /* free_cgroup_storage should always be true as long as
> >>> +      * local_storage->list was non-empty.
> >>> +      */
> >>> +     if (free_cgroup_storage)
> >>> +             kfree_rcu(local_storage, rcu);
> >>> +}
> >>
> >>> +static struct bpf_local_storage_data *
> >>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool
> >>> cacheit_lockit)
> >>> +{
> >>> +     struct bpf_local_storage *cgroup_storage;
> >>> +     struct bpf_local_storage_map *smap;
> >>> +
> >>> +     cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> >>> +                                            bpf_rcu_lock_held());
> >>> +     if (!cgroup_storage)
> >>> +             return NULL;
> >>> +
> >>> +     smap = (struct bpf_local_storage_map *)map;
> >>> +     return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> >>> +}
> >>> +
> >>> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void
> >>> *key)
> >>> +{
> >>> +     struct bpf_local_storage_data *sdata;
> >>> +     struct cgroup *cgroup;
> >>> +     int fd;
> >>> +
> >>> +     fd = *(int *)key;
> >>> +     cgroup = cgroup_get_from_fd(fd);
> >>> +     if (IS_ERR(cgroup))
> >>> +             return ERR_CAST(cgroup);
> >>> +
> >>> +     bpf_cgroup_storage_lock();
> >>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
> >>> +     bpf_cgroup_storage_unlock();
> >>> +     cgroup_put(cgroup);
> >>> +     return sdata ? sdata->data : NULL;
> >>> +}
> >>
> >> A lot of the above (free/lookup) seems to be copy-pasted from the task
> >> storage;
> >> any point in trying to generalize the common parts?
>
> That is true. Let me think about this.
>
> >>
> >>> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
> >>> +                                       void *value, u64 map_flags)
> >>> +{
> >>> +     struct bpf_local_storage_data *sdata;
> >>> +     struct cgroup *cgroup;
> >>> +     int err, fd;
> >>> +
> >>> +     fd = *(int *)key;
> >>> +     cgroup = cgroup_get_from_fd(fd);
> >>> +     if (IS_ERR(cgroup))
> >>> +             return PTR_ERR(cgroup);
> >>> +
> >>> +     bpf_cgroup_storage_lock();
> >>> +     sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> >>> *)map,
> >>> +                                      value, map_flags, GFP_ATOMIC);
> >>> +     bpf_cgroup_storage_unlock();
> >>> +     err = PTR_ERR_OR_ZERO(sdata);
> >>> +     cgroup_put(cgroup);
> >>> +     return err;
> >>> +}
> >>> +
> [...]
> >>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> >>> index 8ad2c267ff47..2fa2c950c7fb 100644
> >>> --- a/kernel/cgroup/cgroup.c
> >>> +++ b/kernel/cgroup/cgroup.c
> >>> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
> >>>                put_css_set_locked(cset->dom_cset);
> >>>        }
> >>
> >>> +#ifdef CONFIG_BPF_SYSCALL
> >>> +     bpf_local_cgroup_storage_free(cset->dfl_cgrp);
> >>> +#endif
> >>> +
> >
> > I am confused about this freeing site. It seems like this path is for
> > freeing css_set's of task_structs, not for freeing the cgroup itself.
> > Wouldn't we want to free the local storage when we free the cgroup
> > itself? Somewhere like css_free_rwork_fn()? or did I completely miss
> > the point here?
>
> Thanks for suggestions here. To be honest, I am not sure whether this
> location is correct or not. I will look at css_free_rwork_fn() which
> might be a good place.
>
> >
> >>>        kfree_rcu(cset, rcu_head);
> >>>    }
> >>
> [...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 20:14         ` Yosry Ahmed
@ 2022-10-17 20:29           ` Yonghong Song
  0 siblings, 0 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-17 20:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: sdf, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo



On 10/17/22 1:14 PM, Yosry Ahmed wrote:
> On Mon, Oct 17, 2022 at 1:10 PM Yonghong Song <yhs@meta.com> wrote:
>>
>>
>>
>> On 10/17/22 11:25 AM, Yosry Ahmed wrote:
>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote:
>>>>
>>>> On 10/13, Yonghong Song wrote:
>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage.
>>>>
>>>>> There already exists a local storage implementation for cgroup-attached
>>>>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>>>>> attached bpf progs wants to access cgroup local storage data. For example,
>>>>> tc egress prog has access to sk and cgroup. It is possible to use
>>>>> sk local storage to emulate cgroup local storage by storing data in
>>>>> socket.
>>>>> But this is a waste as it could be lots of sockets belonging to a
>>>>> particular
>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as
>>>>> the key.
>>>>> But this will introduce additional overhead to manipulate the new map.
>>>>> A cgroup local storage, similar to existing sk/inode/task storage,
>>>>> should help for this use case.
>>>>
>>>>> The life-cycle of storage is managed with the life-cycle of the
>>>>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself
>>>>> is deleted.
>>>>
>>>>> The userspace map operations can be done by using a cgroup fd as a key
>>>>> passed to the lookup, update and delete operations.
>>>>
>>>>
>>>> [..]
>>>>
>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
>>>>> local
>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
>>>>> used
>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two
>>>>> helpers are named as bpf_cgroup_local_storage_get() and
>>>>> bpf_cgroup_local_storage_delete().
>>>>
>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
>>>> cgroup storages shared between programs on the same cgroup") where
>>>> the map changes its behavior depending on the key size (see key_size checks
>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
>>>> can be used so we can, in theory, reuse the name..
>>>>
>>>> Pros:
>>>> - no need for a new map name
>>>>
>>>> Cons:
>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
>>>>      good idea to add more stuff to it?
>>>>
>>>> But, for the very least, should we also extend
>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
>>>> tried to keep some of the important details in there..
>>>
>>> This might be a long shot, but is it possible to switch completely to
>>> this new generic cgroup storage, and for programs that attach to
>>> cgroups we can still do lookups/allocations during attachment like we
>>> do today? IOW, maintain the current API for cgroup progs but switch it
>>> to use this new map type instead.
>>
>> Right, cgroup attach/detach should not be impacted by this patch.
>>
>>>
>>> It feels like this map type is more generic and can be a superset of
>>> the existing cgroup storage, but I feel like I am missing something.
>>
>> One difference is old way cgroup local storage allocates the memory
>> at map creation time, and the new way allocates the memory at runtime
>> when get/update helper is called.
>>
> 
> IIUC the old cgroup local storage allocates memory when a program is
> attached. 

Ya, meta data memory is allocated in map creation time but real storage
is allocated at attach time.

 > We can have the same behavior with the new map type, right?
> When a program is attached to a cgroup, allocate the memory, otherwise
> it is allocated at run time. Does this make sense?

I would like to keep the new functionality flexible so that
even if a program attaching to a cgroup it can still access
other cgroup local storage.

> 
>>>
>>>>
>>>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>> ---
>>>>>     include/linux/bpf.h             |   3 +
>>>>>     include/linux/bpf_types.h       |   1 +
>>>>>     include/linux/cgroup-defs.h     |   4 +
>>>>>     include/uapi/linux/bpf.h        |  39 +++++
>>>>>     kernel/bpf/Makefile             |   2 +-
>>>>>     kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
>>>>>     kernel/bpf/helpers.c            |   6 +
>>>>>     kernel/bpf/syscall.c            |   3 +-
>>>>>     kernel/bpf/verifier.c           |  14 +-
>>>>>     kernel/cgroup/cgroup.c          |   4 +
>>>>>     kernel/trace/bpf_trace.c        |   4 +
>>>>>     scripts/bpf_doc.py              |   2 +
>>>>>     tools/include/uapi/linux/bpf.h  |  39 +++++
>>>>>     13 files changed, 398 insertions(+), 3 deletions(-)
>>>>>     create mode 100644 kernel/bpf/bpf_cgroup_storage.c
>>>>
[...]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:01   ` sdf
  2022-10-17 18:25     ` Yosry Ahmed
@ 2022-10-17 19:23     ` Yonghong Song
  2022-10-17 21:03       ` Stanislav Fomichev
  2022-10-17 22:26     ` Martin KaFai Lau
  2 siblings, 1 reply; 38+ messages in thread
From: Yonghong Song @ 2022-10-17 19:23 UTC (permalink / raw)
  To: sdf, Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo



On 10/17/22 11:01 AM, sdf@google.com wrote:
> On 10/13, Yonghong Song wrote:
>> Similar to sk/inode/task storage, implement similar cgroup local storage.
> 
>> There already exists a local storage implementation for cgroup-attached
>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>> attached bpf progs wants to access cgroup local storage data. For 
>> example,
>> tc egress prog has access to sk and cgroup. It is possible to use
>> sk local storage to emulate cgroup local storage by storing data in 
>> socket.
>> But this is a waste as it could be lots of sockets belonging to a 
>> particular
>> cgroup. Alternatively, a separate map can be created with cgroup id as 
>> the key.
>> But this will introduce additional overhead to manipulate the new map.
>> A cgroup local storage, similar to existing sk/inode/task storage,
>> should help for this use case.
> 
>> The life-cycle of storage is managed with the life-cycle of the
>> cgroup struct.  i.e. the storage is destroyed along with the owning 
>> cgroup
>> with a callback to the bpf_cgroup_storage_free when cgroup itself
>> is deleted.
> 
>> The userspace map operations can be done by using a cgroup fd as a key
>> passed to the lookup, update and delete operations.
> 
> 
> [..]
> 
>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old 
>> cgroup local
>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is 
>> used
>> for cgroup storage available to non-cgroup-attached bpf programs. The two
>> helpers are named as bpf_cgroup_local_storage_get() and
>> bpf_cgroup_local_storage_delete().
> 
> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> cgroup storages shared between programs on the same cgroup") where
> the map changes its behavior depending on the key size (see key_size checks
> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> can be used so we can, in theory, reuse the name..
> 
> Pros:
> - no need for a new map name
> 
> Cons:
> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
>    good idea to add more stuff to it?

Thinking differently. I think I would have reuse the same map name
(BPF_MAP_TYPE_CGROUP_STORAGE) but with a flag like 
BPF_F_LOCAL_STORAGE_GENERIC).

We could use map_extra as well, but I think an explicit flag might be 
better.

> 
> But, for the very least, should we also extend
> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> tried to keep some of the important details in there..
> 
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h             |   3 +
>>   include/linux/bpf_types.h       |   1 +
>>   include/linux/cgroup-defs.h     |   4 +
>>   include/uapi/linux/bpf.h        |  39 +++++
>>   kernel/bpf/Makefile             |   2 +-
>>   kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
>>   kernel/bpf/helpers.c            |   6 +
>>   kernel/bpf/syscall.c            |   3 +-
>>   kernel/bpf/verifier.c           |  14 +-
>>   kernel/cgroup/cgroup.c          |   4 +
>>   kernel/trace/bpf_trace.c        |   4 +
>>   scripts/bpf_doc.py              |   2 +
>>   tools/include/uapi/linux/bpf.h  |  39 +++++
>>   13 files changed, 398 insertions(+), 3 deletions(-)
>>   create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> 
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 9e7d46d16032..1395a01c7f18 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
> 
>>   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id 
>> func_id);
>>   void bpf_task_storage_free(struct task_struct *task);
>> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
>>   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
>>   const struct btf_func_model *
>>   bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
>> @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto 
>> bpf_copy_from_user_task_proto;
>>   extern const struct bpf_func_proto bpf_set_retval_proto;
>>   extern const struct bpf_func_proto bpf_get_retval_proto;
>>   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
>> +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
>> +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
> 
>>   const struct bpf_func_proto *tracing_prog_func_proto(
>>     enum bpf_func_id func_id, const struct bpf_prog *prog);
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index 2c6a4f2562a7..7a0362d7a0aa 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, 
>> cgroup_array_map_ops)
>>   #ifdef CONFIG_CGROUP_BPF
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, 
>> cgroup_storage_map_ops)
>> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, 
>> cgroup_local_storage_map_ops)
>>   #endif
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 4bcf56b3491c..c6f4590dda68 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -504,6 +504,10 @@ struct cgroup {
>>       /* Used to store internal freezer state */
>>       struct cgroup_freezer_state freezer;
> 
>> +#ifdef CONFIG_BPF_SYSCALL
>> +    struct bpf_local_storage __rcu  *bpf_cgroup_storage;
>> +#endif
>> +
>>       /* ids of the ancestors at each level including self */
>>       u64 ancestor_ids[];
>>   };
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 17f61338f8f8..d918b4054297 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -935,6 +935,7 @@ enum bpf_map_type {
>>       BPF_MAP_TYPE_TASK_STORAGE,
>>       BPF_MAP_TYPE_BLOOM_FILTER,
>>       BPF_MAP_TYPE_USER_RINGBUF,
>> +    BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
>>   };
> 
>>   /* Note that tracing related programs such as
>> @@ -5435,6 +5436,42 @@ union bpf_attr {
>>    *        **-E2BIG** if user-space has tried to publish a sample 
>> which is
>>    *        larger than the size of the ring buffer, or which cannot fit
>>    *        within a struct bpf_dynptr.
>> + *
>> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct 
>> cgroup *cgroup, void *value, u64 flags)
>> + *    Description
>> + *        Get a bpf_local_storage from the *cgroup*.
>> + *
>> + *        Logically, it could be thought of as getting the value from
>> + *        a *map* with *cgroup* as the **key**.  From this
>> + *        perspective,  the usage is not much different from
>> + *        **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
>> + *        helper enforces the key must be a cgroup struct and the map 
>> must also
>> + *        be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
>> + *
>> + *        Underneath, the value is stored locally at *cgroup* instead of
>> + *        the *map*.  The *map* is used as the bpf-local-storage
>> + *        "type". The bpf-local-storage "type" (i.e. the *map*) is
>> + *        searched against all bpf_local_storage residing at *cgroup*.
>> + *
>> + *        An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) 
>> can be
>> + *        used such that a new bpf_local_storage will be
>> + *        created if one does not exist.  *value* can be used
>> + *        together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
>> + *        the initial value of a bpf_local_storage.  If *value* is
>> + *        **NULL**, the new bpf_local_storage will be zero initialized.
>> + *    Return
>> + *        A bpf_local_storage pointer is returned on success.
>> + *
>> + *        **NULL** if not found or there was an error in adding
>> + *        a new bpf_local_storage.
>> + *
>> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct 
>> cgroup *cgroup)
>> + *    Description
>> + *        Delete a bpf_local_storage from a *cgroup*.
>> + *    Return
>> + *        0 on success.
>> + *
>> + *        **-ENOENT** if the bpf_local_storage cannot be found.
>>    */
>>   #define ___BPF_FUNC_MAPPER(FN, ctx...)            \
>>       FN(unspec, 0, ##ctx)                \
>> @@ -5647,6 +5684,8 @@ union bpf_attr {
>>       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
>>       FN(ktime_get_tai_ns, 208, ##ctx)        \
>>       FN(user_ringbuf_drain, 209, ##ctx)        \
>> +    FN(cgroup_local_storage_get, 210, ##ctx)    \
>> +    FN(cgroup_local_storage_delete, 211, ##ctx)    \
>>       /* */
> 
>>   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER 
>> that don't
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index 341c94f208f4..b02693f51978 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
>>   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
>>   endif
>>   ifeq ($(CONFIG_CGROUPS),y)
>> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
>> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
>>   endif
>>   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
>>   ifeq ($(CONFIG_INET),y)
>> diff --git a/kernel/bpf/bpf_cgroup_storage.c 
>> b/kernel/bpf/bpf_cgroup_storage.c
>> new file mode 100644
>> index 000000000000..9974784822da
>> --- /dev/null
>> +++ b/kernel/bpf/bpf_cgroup_storage.c
>> @@ -0,0 +1,280 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
>> + */
>> +
>> +#include <linux/types.h>
>> +#include <linux/bpf.h>
>> +#include <linux/bpf_local_storage.h>
>> +#include <uapi/linux/btf.h>
>> +#include <linux/btf_ids.h>
>> +
>> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
>> +
>> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
>> +
>> +static void bpf_cgroup_storage_lock(void)
>> +{
>> +    migrate_disable();
>> +    this_cpu_inc(bpf_cgroup_storage_busy);
>> +}
>> +
>> +static void bpf_cgroup_storage_unlock(void)
>> +{
>> +    this_cpu_dec(bpf_cgroup_storage_busy);
>> +    migrate_enable();
>> +}
>> +
>> +static bool bpf_cgroup_storage_trylock(void)
>> +{
>> +    migrate_disable();
>> +    if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
>> +        this_cpu_dec(bpf_cgroup_storage_busy);
>> +        migrate_enable();
>> +        return false;
>> +    }
>> +    return true;
>> +}
> 
> Task storage has lock/unlock/trylock; inode storage doesn't; why does
> cgroup need it as well?
> 
>> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
>> +{
>> +    struct cgroup *cg = owner;
>> +
>> +    return &cg->bpf_cgroup_storage;
>> +}
>> +
>> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
>> +{
>> +    struct bpf_local_storage *local_storage;
>> +    struct bpf_local_storage_elem *selem;
>> +    bool free_cgroup_storage = false;
>> +    struct hlist_node *n;
>> +    unsigned long flags;
>> +
>> +    rcu_read_lock();
>> +    local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
>> +    if (!local_storage) {
>> +        rcu_read_unlock();
>> +        return;
>> +    }
>> +
>> +    /* Neither the bpf_prog nor the bpf-map's syscall
>> +     * could be modifying the local_storage->list now.
>> +     * Thus, no elem can be added-to or deleted-from the
>> +     * local_storage->list by the bpf_prog or by the bpf-map's syscall.
>> +     *
>> +     * It is racing with bpf_local_storage_map_free() alone
>> +     * when unlinking elem from the local_storage->list and
>> +     * the map's bucket->list.
>> +     */
>> +    bpf_cgroup_storage_lock();
>> +    raw_spin_lock_irqsave(&local_storage->lock, flags);
>> +    hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>> +        bpf_selem_unlink_map(selem);
>> +        free_cgroup_storage =
>> +            bpf_selem_unlink_storage_nolock(local_storage, selem, 
>> false, false);
>> +    }
>> +    raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>> +    bpf_cgroup_storage_unlock();
>> +    rcu_read_unlock();
>> +
>> +    /* free_cgroup_storage should always be true as long as
>> +     * local_storage->list was non-empty.
>> +     */
>> +    if (free_cgroup_storage)
>> +        kfree_rcu(local_storage, rcu);
>> +}
> 
>> +static struct bpf_local_storage_data *
>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, 
>> bool cacheit_lockit)
>> +{
>> +    struct bpf_local_storage *cgroup_storage;
>> +    struct bpf_local_storage_map *smap;
>> +
>> +    cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
>> +                           bpf_rcu_lock_held());
>> +    if (!cgroup_storage)
>> +        return NULL;
>> +
>> +    smap = (struct bpf_local_storage_map *)map;
>> +    return bpf_local_storage_lookup(cgroup_storage, smap, 
>> cacheit_lockit);
>> +}
>> +
>> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void 
>> *key)
>> +{
>> +    struct bpf_local_storage_data *sdata;
>> +    struct cgroup *cgroup;
>> +    int fd;
>> +
>> +    fd = *(int *)key;
>> +    cgroup = cgroup_get_from_fd(fd);
>> +    if (IS_ERR(cgroup))
>> +        return ERR_CAST(cgroup);
>> +
>> +    bpf_cgroup_storage_lock();
>> +    sdata = cgroup_storage_lookup(cgroup, map, true);
>> +    bpf_cgroup_storage_unlock();
>> +    cgroup_put(cgroup);
>> +    return sdata ? sdata->data : NULL;
>> +}
> 
> A lot of the above (free/lookup) seems to be copy-pasted from the task 
> storage;
> any point in trying to generalize the common parts?
> 
>> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void 
>> *key,
>> +                      void *value, u64 map_flags)
>> +{
>> +    struct bpf_local_storage_data *sdata;
>> +    struct cgroup *cgroup;
>> +    int err, fd;
>> +
>> +    fd = *(int *)key;
>> +    cgroup = cgroup_get_from_fd(fd);
>> +    if (IS_ERR(cgroup))
>> +        return PTR_ERR(cgroup);
>> +
>> +    bpf_cgroup_storage_lock();
>> +    sdata = bpf_local_storage_update(cgroup, (struct 
>> bpf_local_storage_map *)map,
>> +                     value, map_flags, GFP_ATOMIC);
>> +    bpf_cgroup_storage_unlock();
>> +    err = PTR_ERR_OR_ZERO(sdata);
>> +    cgroup_put(cgroup);
>> +    return err;
>> +}
>> +
>> +static int cgroup_storage_delete(struct cgroup *cgroup, struct 
>> bpf_map *map)
>> +{
>> +    struct bpf_local_storage_data *sdata;
>> +
>> +    sdata = cgroup_storage_lookup(cgroup, map, false);
>> +    if (!sdata)
>> +        return -ENOENT;
>> +
>> +    bpf_selem_unlink(SELEM(sdata), true);
>> +    return 0;
>> +}
>> +
>> +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void 
>> *key)
>> +{
>> +    struct cgroup *cgroup;
>> +    int err, fd;
>> +
>> +    fd = *(int *)key;
>> +    cgroup = cgroup_get_from_fd(fd);
>> +    if (IS_ERR(cgroup))
>> +        return PTR_ERR(cgroup);
>> +
>> +    bpf_cgroup_storage_lock();
>> +    err = cgroup_storage_delete(cgroup, map);
>> +    bpf_cgroup_storage_unlock();
>> +    if (err)
>> +        return err;
>> +
>> +    cgroup_put(cgroup);
>> +    return 0;
>> +}
>> +
>> +static int notsupp_get_next_key(struct bpf_map *map, void *key, void 
>> *next_key)
>> +{
>> +    return -ENOTSUPP;
>> +}
>> +
>> +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
>> +{
>> +    struct bpf_local_storage_map *smap;
>> +
>> +    smap = bpf_local_storage_map_alloc(attr);
>> +    if (IS_ERR(smap))
>> +        return ERR_CAST(smap);
>> +
>> +    smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache);
>> +    return &smap->map;
>> +}
>> +
>> +static void cgroup_storage_map_free(struct bpf_map *map)
>> +{
>> +    struct bpf_local_storage_map *smap;
>> +
>> +    smap = (struct bpf_local_storage_map *)map;
>> +    bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx);
>> +    bpf_local_storage_map_free(smap, NULL);
>> +}
>> +
>> +/* *gfp_flags* is a hidden argument provided by the verifier */
>> +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct 
>> cgroup *, cgroup,
>> +       void *, value, u64, flags, gfp_t, gfp_flags)
>> +{
>> +    struct bpf_local_storage_data *sdata;
>> +
>> +    WARN_ON_ONCE(!bpf_rcu_lock_held());
>> +    if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
>> +        return (unsigned long)NULL;
>> +
>> +    if (!cgroup)
>> +        return (unsigned long)NULL;
>> +
>> +    if (!bpf_cgroup_storage_trylock())
>> +        return (unsigned long)NULL;
>> +
>> +    sdata = cgroup_storage_lookup(cgroup, map, true);
>> +    if (sdata)
>> +        goto unlock;
>> +
>> +    /* only allocate new storage, when the cgroup is refcounted */
>> +    if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
>> +        (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
>> +        sdata = bpf_local_storage_update(cgroup, (struct 
>> bpf_local_storage_map *)map,
>> +                         value, BPF_NOEXIST, gfp_flags);
>> +
>> +unlock:
>> +    bpf_cgroup_storage_unlock();
>> +    return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned 
>> long)sdata->data;
>> +}
>> +
>> +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct 
>> cgroup *, cgroup)
>> +{
>> +    int ret;
>> +
>> +    WARN_ON_ONCE(!bpf_rcu_lock_held());
>> +    if (!cgroup)
>> +        return -EINVAL;
>> +
>> +    if (!bpf_cgroup_storage_trylock())
>> +        return -EBUSY;
>> +
>> +    ret = cgroup_storage_delete(cgroup, map);
>> +    bpf_cgroup_storage_unlock();
>> +    return ret;
>> +}
>> +
>> +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, 
>> bpf_local_storage_map)
>> +const struct bpf_map_ops cgroup_local_storage_map_ops = {
>> +    .map_meta_equal = bpf_map_meta_equal,
>> +    .map_alloc_check = bpf_local_storage_map_alloc_check,
>> +    .map_alloc = cgroup_storage_map_alloc,
>> +    .map_free = cgroup_storage_map_free,
>> +    .map_get_next_key = notsupp_get_next_key,
>> +    .map_lookup_elem = bpf_cgroup_storage_lookup_elem,
>> +    .map_update_elem = bpf_cgroup_storage_update_elem,
>> +    .map_delete_elem = bpf_cgroup_storage_delete_elem,
>> +    .map_check_btf = bpf_local_storage_map_check_btf,
>> +    .map_btf_id = &cgroup_storage_map_btf_ids[0],
>> +    .map_owner_storage_ptr = cgroup_storage_ptr,
>> +};
>> +
>> +const struct bpf_func_proto bpf_cgroup_storage_get_proto = {
>> +    .func        = bpf_cgroup_storage_get,
>> +    .gpl_only    = false,
>> +    .ret_type    = RET_PTR_TO_MAP_VALUE_OR_NULL,
>> +    .arg1_type    = ARG_CONST_MAP_PTR,
>> +    .arg2_type    = ARG_PTR_TO_BTF_ID,
>> +    .arg2_btf_id    = &bpf_cgroup_btf_id[0],
>> +    .arg3_type    = ARG_PTR_TO_MAP_VALUE_OR_NULL,
>> +    .arg4_type    = ARG_ANYTHING,
>> +};
>> +
>> +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = {
>> +    .func        = bpf_cgroup_storage_delete,
>> +    .gpl_only    = false,
>> +    .ret_type    = RET_INTEGER,
>> +    .arg1_type    = ARG_CONST_MAP_PTR,
>> +    .arg2_type    = ARG_PTR_TO_BTF_ID,
>> +    .arg2_btf_id    = &bpf_cgroup_btf_id[0],
>> +};
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index a6b04faed282..5c5bb08832ec 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>>           return &bpf_dynptr_write_proto;
>>       case BPF_FUNC_dynptr_data:
>>           return &bpf_dynptr_data_proto;
>> +#ifdef CONFIG_CGROUPS
>> +    case BPF_FUNC_cgroup_local_storage_get:
>> +        return &bpf_cgroup_storage_get_proto;
>> +    case BPF_FUNC_cgroup_local_storage_delete:
>> +        return &bpf_cgroup_storage_delete_proto;
>> +#endif
>>       default:
>>           break;
>>       }
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 7b373a5e861f..e53c7fae6e22 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, 
>> const struct btf *btf,
>>               map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
>>               map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
>>               map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
>> -            map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
>> +            map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
>> +            map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
>>               return -ENOTSUPP;
>>           if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
>>               map->value_size) {
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 6f6d2d511c06..f36f6a3c0d50 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct 
>> bpf_verifier_env *env,
>>               func_id != BPF_FUNC_task_storage_delete)
>>               goto error;
>>           break;
>> +    case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
>> +        if (func_id != BPF_FUNC_cgroup_local_storage_get &&
>> +            func_id != BPF_FUNC_cgroup_local_storage_delete)
>> +            goto error;
>> +        break;
>>       case BPF_MAP_TYPE_BLOOM_FILTER:
>>           if (func_id != BPF_FUNC_map_peek_elem &&
>>               func_id != BPF_FUNC_map_push_elem)
>> @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct 
>> bpf_verifier_env *env,
>>           if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
>>               goto error;
>>           break;
>> +    case BPF_FUNC_cgroup_local_storage_get:
>> +    case BPF_FUNC_cgroup_local_storage_delete:
>> +        if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
>> +            goto error;
>> +        break;
>>       default:
>>           break;
>>       }
>> @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct 
>> bpf_verifier_env *env,
>>           case BPF_MAP_TYPE_INODE_STORAGE:
>>           case BPF_MAP_TYPE_SK_STORAGE:
>>           case BPF_MAP_TYPE_TASK_STORAGE:
>> +        case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
>>               break;
>>           default:
>>               verbose(env,
>> @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct 
>> bpf_verifier_env *env)
> 
>>           if (insn->imm == BPF_FUNC_task_storage_get ||
>>               insn->imm == BPF_FUNC_sk_storage_get ||
>> -            insn->imm == BPF_FUNC_inode_storage_get) {
>> +            insn->imm == BPF_FUNC_inode_storage_get ||
>> +            insn->imm == BPF_FUNC_cgroup_local_storage_get) {
>>               if (env->prog->aux->sleepable)
>>                   insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force 
>> __s32)GFP_KERNEL);
>>               else
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 8ad2c267ff47..2fa2c950c7fb 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
>>           put_css_set_locked(cset->dom_cset);
>>       }
> 
>> +#ifdef CONFIG_BPF_SYSCALL
>> +    bpf_local_cgroup_storage_free(cset->dfl_cgrp);
>> +#endif
>> +
>>       kfree_rcu(cset, rcu_head);
>>   }
> 
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index 688552df95ca..179adaae4a9f 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id 
>> func_id, const struct bpf_prog *prog)
>>           return &bpf_get_current_cgroup_id_proto;
>>       case BPF_FUNC_get_current_ancestor_cgroup_id:
>>           return &bpf_get_current_ancestor_cgroup_id_proto;
>> +    case BPF_FUNC_cgroup_local_storage_get:
>> +        return &bpf_cgroup_storage_get_proto;
>> +    case BPF_FUNC_cgroup_local_storage_delete:
>> +        return &bpf_cgroup_storage_delete_proto;
>>   #endif
>>       case BPF_FUNC_send_signal:
>>           return &bpf_send_signal_proto;
>> diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
>> index c0e6690be82a..fdb0aff8cb5a 100755
>> --- a/scripts/bpf_doc.py
>> +++ b/scripts/bpf_doc.py
>> @@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
>>               'struct udp6_sock',
>>               'struct unix_sock',
>>               'struct task_struct',
>> +            'struct cgroup',
> 
>>               'struct __sk_buff',
>>               'struct sk_msg_md',
>> @@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
>>               'struct udp6_sock',
>>               'struct unix_sock',
>>               'struct task_struct',
>> +            'struct cgroup',
>>               'struct path',
>>               'struct btf_ptr',
>>               'struct inode',
>> diff --git a/tools/include/uapi/linux/bpf.h 
>> b/tools/include/uapi/linux/bpf.h
>> index 17f61338f8f8..d918b4054297 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -935,6 +935,7 @@ enum bpf_map_type {
>>       BPF_MAP_TYPE_TASK_STORAGE,
>>       BPF_MAP_TYPE_BLOOM_FILTER,
>>       BPF_MAP_TYPE_USER_RINGBUF,
>> +    BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
>>   };
> 
>>   /* Note that tracing related programs such as
>> @@ -5435,6 +5436,42 @@ union bpf_attr {
>>    *        **-E2BIG** if user-space has tried to publish a sample 
>> which is
>>    *        larger than the size of the ring buffer, or which cannot fit
>>    *        within a struct bpf_dynptr.
>> + *
>> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct 
>> cgroup *cgroup, void *value, u64 flags)
>> + *    Description
>> + *        Get a bpf_local_storage from the *cgroup*.
>> + *
>> + *        Logically, it could be thought of as getting the value from
>> + *        a *map* with *cgroup* as the **key**.  From this
>> + *        perspective,  the usage is not much different from
>> + *        **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
>> + *        helper enforces the key must be a cgroup struct and the map 
>> must also
>> + *        be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
>> + *
>> + *        Underneath, the value is stored locally at *cgroup* instead of
>> + *        the *map*.  The *map* is used as the bpf-local-storage
>> + *        "type". The bpf-local-storage "type" (i.e. the *map*) is
>> + *        searched against all bpf_local_storage residing at *cgroup*.
>> + *
>> + *        An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) 
>> can be
>> + *        used such that a new bpf_local_storage will be
>> + *        created if one does not exist.  *value* can be used
>> + *        together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
>> + *        the initial value of a bpf_local_storage.  If *value* is
>> + *        **NULL**, the new bpf_local_storage will be zero initialized.
>> + *    Return
>> + *        A bpf_local_storage pointer is returned on success.
>> + *
>> + *        **NULL** if not found or there was an error in adding
>> + *        a new bpf_local_storage.
>> + *
>> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct 
>> cgroup *cgroup)
>> + *    Description
>> + *        Delete a bpf_local_storage from a *cgroup*.
>> + *    Return
>> + *        0 on success.
>> + *
>> + *        **-ENOENT** if the bpf_local_storage cannot be found.
>>    */
>>   #define ___BPF_FUNC_MAPPER(FN, ctx...)            \
>>       FN(unspec, 0, ##ctx)                \
>> @@ -5647,6 +5684,8 @@ union bpf_attr {
>>       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
>>       FN(ktime_get_tai_ns, 208, ##ctx)        \
>>       FN(user_ringbuf_drain, 209, ##ctx)        \
>> +    FN(cgroup_local_storage_get, 210, ##ctx)    \
>> +    FN(cgroup_local_storage_delete, 211, ##ctx)    \
>>       /* */
> 
>>   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER 
>> that don't
>> -- 
>> 2.30.2
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 19:23     ` Yonghong Song
@ 2022-10-17 21:03       ` Stanislav Fomichev
  0 siblings, 0 replies; 38+ messages in thread
From: Stanislav Fomichev @ 2022-10-17 21:03 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo

On Mon, Oct 17, 2022 at 12:25 PM Yonghong Song <yhs@meta.com> wrote:
>
>
>
> On 10/17/22 11:01 AM, sdf@google.com wrote:
> > On 10/13, Yonghong Song wrote:
> >> Similar to sk/inode/task storage, implement similar cgroup local storage.
> >
> >> There already exists a local storage implementation for cgroup-attached
> >> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> >> bpf_get_local_storage(). But there are use cases such that non-cgroup
> >> attached bpf progs wants to access cgroup local storage data. For
> >> example,
> >> tc egress prog has access to sk and cgroup. It is possible to use
> >> sk local storage to emulate cgroup local storage by storing data in
> >> socket.
> >> But this is a waste as it could be lots of sockets belonging to a
> >> particular
> >> cgroup. Alternatively, a separate map can be created with cgroup id as
> >> the key.
> >> But this will introduce additional overhead to manipulate the new map.
> >> A cgroup local storage, similar to existing sk/inode/task storage,
> >> should help for this use case.
> >
> >> The life-cycle of storage is managed with the life-cycle of the
> >> cgroup struct.  i.e. the storage is destroyed along with the owning
> >> cgroup
> >> with a callback to the bpf_cgroup_storage_free when cgroup itself
> >> is deleted.
> >
> >> The userspace map operations can be done by using a cgroup fd as a key
> >> passed to the lookup, update and delete operations.
> >
> >
> > [..]
> >
> >> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old
> >> cgroup local
> >> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> >> used
> >> for cgroup storage available to non-cgroup-attached bpf programs. The two
> >> helpers are named as bpf_cgroup_local_storage_get() and
> >> bpf_cgroup_local_storage_delete().
> >
> > Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> > cgroup storages shared between programs on the same cgroup") where
> > the map changes its behavior depending on the key size (see key_size checks
> > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> > can be used so we can, in theory, reuse the name..
> >
> > Pros:
> > - no need for a new map name
> >
> > Cons:
> > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> >    good idea to add more stuff to it?
>
> Thinking differently. I think I would have reuse the same map name
> (BPF_MAP_TYPE_CGROUP_STORAGE) but with a flag like
> BPF_F_LOCAL_STORAGE_GENERIC).
>
> We could use map_extra as well, but I think an explicit flag might be
> better.

Ack, flag and map_extra might work as well. They are more explicit,
which is good/bad depending on who you talk to.
I was assuming that we can just support the following:

struct {
  __uint(type, BPF_MAP_TYPE_CGROUP_STORAGE);
  __type(key, int);
  __type(value, xxx);
} ...;

and depend on key_size == sizeof(int), but up to you; just trying to
understand whether it makes sense to share the name or not.

Sharing the helper probably not worth it given the special treatment?
Or maybe it can be a shortcut to "lookup this map with my cgroup"?

> >
> > But, for the very least, should we also extend
> > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> > tried to keep some of the important details in there..
> >
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   include/linux/bpf.h             |   3 +
> >>   include/linux/bpf_types.h       |   1 +
> >>   include/linux/cgroup-defs.h     |   4 +
> >>   include/uapi/linux/bpf.h        |  39 +++++
> >>   kernel/bpf/Makefile             |   2 +-
> >>   kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
> >>   kernel/bpf/helpers.c            |   6 +
> >>   kernel/bpf/syscall.c            |   3 +-
> >>   kernel/bpf/verifier.c           |  14 +-
> >>   kernel/cgroup/cgroup.c          |   4 +
> >>   kernel/trace/bpf_trace.c        |   4 +
> >>   scripts/bpf_doc.py              |   2 +
> >>   tools/include/uapi/linux/bpf.h  |  39 +++++
> >>   13 files changed, 398 insertions(+), 3 deletions(-)
> >>   create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> >
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index 9e7d46d16032..1395a01c7f18 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
> >
> >>   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id
> >> func_id);
> >>   void bpf_task_storage_free(struct task_struct *task);
> >> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
> >>   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
> >>   const struct btf_func_model *
> >>   bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
> >> @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto
> >> bpf_copy_from_user_task_proto;
> >>   extern const struct bpf_func_proto bpf_set_retval_proto;
> >>   extern const struct bpf_func_proto bpf_get_retval_proto;
> >>   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> >> +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
> >> +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
> >
> >>   const struct bpf_func_proto *tracing_prog_func_proto(
> >>     enum bpf_func_id func_id, const struct bpf_prog *prog);
> >> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> >> index 2c6a4f2562a7..7a0362d7a0aa 100644
> >> --- a/include/linux/bpf_types.h
> >> +++ b/include/linux/bpf_types.h
> >> @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY,
> >> cgroup_array_map_ops)
> >>   #ifdef CONFIG_CGROUP_BPF
> >>   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
> >>   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
> >> cgroup_storage_map_ops)
> >> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> >> cgroup_local_storage_map_ops)
> >>   #endif
> >>   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
> >>   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
> >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> >> index 4bcf56b3491c..c6f4590dda68 100644
> >> --- a/include/linux/cgroup-defs.h
> >> +++ b/include/linux/cgroup-defs.h
> >> @@ -504,6 +504,10 @@ struct cgroup {
> >>       /* Used to store internal freezer state */
> >>       struct cgroup_freezer_state freezer;
> >
> >> +#ifdef CONFIG_BPF_SYSCALL
> >> +    struct bpf_local_storage __rcu  *bpf_cgroup_storage;
> >> +#endif
> >> +
> >>       /* ids of the ancestors at each level including self */
> >>       u64 ancestor_ids[];
> >>   };
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 17f61338f8f8..d918b4054297 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -935,6 +935,7 @@ enum bpf_map_type {
> >>       BPF_MAP_TYPE_TASK_STORAGE,
> >>       BPF_MAP_TYPE_BLOOM_FILTER,
> >>       BPF_MAP_TYPE_USER_RINGBUF,
> >> +    BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> >>   };
> >
> >>   /* Note that tracing related programs such as
> >> @@ -5435,6 +5436,42 @@ union bpf_attr {
> >>    *        **-E2BIG** if user-space has tried to publish a sample
> >> which is
> >>    *        larger than the size of the ring buffer, or which cannot fit
> >>    *        within a struct bpf_dynptr.
> >> + *
> >> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct
> >> cgroup *cgroup, void *value, u64 flags)
> >> + *    Description
> >> + *        Get a bpf_local_storage from the *cgroup*.
> >> + *
> >> + *        Logically, it could be thought of as getting the value from
> >> + *        a *map* with *cgroup* as the **key**.  From this
> >> + *        perspective,  the usage is not much different from
> >> + *        **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> >> + *        helper enforces the key must be a cgroup struct and the map
> >> must also
> >> + *        be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> >> + *
> >> + *        Underneath, the value is stored locally at *cgroup* instead of
> >> + *        the *map*.  The *map* is used as the bpf-local-storage
> >> + *        "type". The bpf-local-storage "type" (i.e. the *map*) is
> >> + *        searched against all bpf_local_storage residing at *cgroup*.
> >> + *
> >> + *        An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**)
> >> can be
> >> + *        used such that a new bpf_local_storage will be
> >> + *        created if one does not exist.  *value* can be used
> >> + *        together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> >> + *        the initial value of a bpf_local_storage.  If *value* is
> >> + *        **NULL**, the new bpf_local_storage will be zero initialized.
> >> + *    Return
> >> + *        A bpf_local_storage pointer is returned on success.
> >> + *
> >> + *        **NULL** if not found or there was an error in adding
> >> + *        a new bpf_local_storage.
> >> + *
> >> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> >> cgroup *cgroup)
> >> + *    Description
> >> + *        Delete a bpf_local_storage from a *cgroup*.
> >> + *    Return
> >> + *        0 on success.
> >> + *
> >> + *        **-ENOENT** if the bpf_local_storage cannot be found.
> >>    */
> >>   #define ___BPF_FUNC_MAPPER(FN, ctx...)            \
> >>       FN(unspec, 0, ##ctx)                \
> >> @@ -5647,6 +5684,8 @@ union bpf_attr {
> >>       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> >>       FN(ktime_get_tai_ns, 208, ##ctx)        \
> >>       FN(user_ringbuf_drain, 209, ##ctx)        \
> >> +    FN(cgroup_local_storage_get, 210, ##ctx)    \
> >> +    FN(cgroup_local_storage_delete, 211, ##ctx)    \
> >>       /* */
> >
> >>   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER
> >> that don't
> >> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> >> index 341c94f208f4..b02693f51978 100644
> >> --- a/kernel/bpf/Makefile
> >> +++ b/kernel/bpf/Makefile
> >> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> >>   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> >>   endif
> >>   ifeq ($(CONFIG_CGROUPS),y)
> >> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> >> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
> >>   endif
> >>   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> >>   ifeq ($(CONFIG_INET),y)
> >> diff --git a/kernel/bpf/bpf_cgroup_storage.c
> >> b/kernel/bpf/bpf_cgroup_storage.c
> >> new file mode 100644
> >> index 000000000000..9974784822da
> >> --- /dev/null
> >> +++ b/kernel/bpf/bpf_cgroup_storage.c
> >> @@ -0,0 +1,280 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> >> + */
> >> +
> >> +#include <linux/types.h>
> >> +#include <linux/bpf.h>
> >> +#include <linux/bpf_local_storage.h>
> >> +#include <uapi/linux/btf.h>
> >> +#include <linux/btf_ids.h>
> >> +
> >> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> >> +
> >> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> >> +
> >> +static void bpf_cgroup_storage_lock(void)
> >> +{
> >> +    migrate_disable();
> >> +    this_cpu_inc(bpf_cgroup_storage_busy);
> >> +}
> >> +
> >> +static void bpf_cgroup_storage_unlock(void)
> >> +{
> >> +    this_cpu_dec(bpf_cgroup_storage_busy);
> >> +    migrate_enable();
> >> +}
> >> +
> >> +static bool bpf_cgroup_storage_trylock(void)
> >> +{
> >> +    migrate_disable();
> >> +    if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> >> +        this_cpu_dec(bpf_cgroup_storage_busy);
> >> +        migrate_enable();
> >> +        return false;
> >> +    }
> >> +    return true;
> >> +}
> >
> > Task storage has lock/unlock/trylock; inode storage doesn't; why does
> > cgroup need it as well?
> >
> >> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> >> +{
> >> +    struct cgroup *cg = owner;
> >> +
> >> +    return &cg->bpf_cgroup_storage;
> >> +}
> >> +
> >> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> >> +{
> >> +    struct bpf_local_storage *local_storage;
> >> +    struct bpf_local_storage_elem *selem;
> >> +    bool free_cgroup_storage = false;
> >> +    struct hlist_node *n;
> >> +    unsigned long flags;
> >> +
> >> +    rcu_read_lock();
> >> +    local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> >> +    if (!local_storage) {
> >> +        rcu_read_unlock();
> >> +        return;
> >> +    }
> >> +
> >> +    /* Neither the bpf_prog nor the bpf-map's syscall
> >> +     * could be modifying the local_storage->list now.
> >> +     * Thus, no elem can be added-to or deleted-from the
> >> +     * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> >> +     *
> >> +     * It is racing with bpf_local_storage_map_free() alone
> >> +     * when unlinking elem from the local_storage->list and
> >> +     * the map's bucket->list.
> >> +     */
> >> +    bpf_cgroup_storage_lock();
> >> +    raw_spin_lock_irqsave(&local_storage->lock, flags);
> >> +    hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> >> +        bpf_selem_unlink_map(selem);
> >> +        free_cgroup_storage =
> >> +            bpf_selem_unlink_storage_nolock(local_storage, selem,
> >> false, false);
> >> +    }
> >> +    raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> >> +    bpf_cgroup_storage_unlock();
> >> +    rcu_read_unlock();
> >> +
> >> +    /* free_cgroup_storage should always be true as long as
> >> +     * local_storage->list was non-empty.
> >> +     */
> >> +    if (free_cgroup_storage)
> >> +        kfree_rcu(local_storage, rcu);
> >> +}
> >
> >> +static struct bpf_local_storage_data *
> >> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map,
> >> bool cacheit_lockit)
> >> +{
> >> +    struct bpf_local_storage *cgroup_storage;
> >> +    struct bpf_local_storage_map *smap;
> >> +
> >> +    cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> >> +                           bpf_rcu_lock_held());
> >> +    if (!cgroup_storage)
> >> +        return NULL;
> >> +
> >> +    smap = (struct bpf_local_storage_map *)map;
> >> +    return bpf_local_storage_lookup(cgroup_storage, smap,
> >> cacheit_lockit);
> >> +}
> >> +
> >> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void
> >> *key)
> >> +{
> >> +    struct bpf_local_storage_data *sdata;
> >> +    struct cgroup *cgroup;
> >> +    int fd;
> >> +
> >> +    fd = *(int *)key;
> >> +    cgroup = cgroup_get_from_fd(fd);
> >> +    if (IS_ERR(cgroup))
> >> +        return ERR_CAST(cgroup);
> >> +
> >> +    bpf_cgroup_storage_lock();
> >> +    sdata = cgroup_storage_lookup(cgroup, map, true);
> >> +    bpf_cgroup_storage_unlock();
> >> +    cgroup_put(cgroup);
> >> +    return sdata ? sdata->data : NULL;
> >> +}
> >
> > A lot of the above (free/lookup) seems to be copy-pasted from the task
> > storage;
> > any point in trying to generalize the common parts?
> >
> >> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void
> >> *key,
> >> +                      void *value, u64 map_flags)
> >> +{
> >> +    struct bpf_local_storage_data *sdata;
> >> +    struct cgroup *cgroup;
> >> +    int err, fd;
> >> +
> >> +    fd = *(int *)key;
> >> +    cgroup = cgroup_get_from_fd(fd);
> >> +    if (IS_ERR(cgroup))
> >> +        return PTR_ERR(cgroup);
> >> +
> >> +    bpf_cgroup_storage_lock();
> >> +    sdata = bpf_local_storage_update(cgroup, (struct
> >> bpf_local_storage_map *)map,
> >> +                     value, map_flags, GFP_ATOMIC);
> >> +    bpf_cgroup_storage_unlock();
> >> +    err = PTR_ERR_OR_ZERO(sdata);
> >> +    cgroup_put(cgroup);
> >> +    return err;
> >> +}
> >> +
> >> +static int cgroup_storage_delete(struct cgroup *cgroup, struct
> >> bpf_map *map)
> >> +{
> >> +    struct bpf_local_storage_data *sdata;
> >> +
> >> +    sdata = cgroup_storage_lookup(cgroup, map, false);
> >> +    if (!sdata)
> >> +        return -ENOENT;
> >> +
> >> +    bpf_selem_unlink(SELEM(sdata), true);
> >> +    return 0;
> >> +}
> >> +
> >> +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void
> >> *key)
> >> +{
> >> +    struct cgroup *cgroup;
> >> +    int err, fd;
> >> +
> >> +    fd = *(int *)key;
> >> +    cgroup = cgroup_get_from_fd(fd);
> >> +    if (IS_ERR(cgroup))
> >> +        return PTR_ERR(cgroup);
> >> +
> >> +    bpf_cgroup_storage_lock();
> >> +    err = cgroup_storage_delete(cgroup, map);
> >> +    bpf_cgroup_storage_unlock();
> >> +    if (err)
> >> +        return err;
> >> +
> >> +    cgroup_put(cgroup);
> >> +    return 0;
> >> +}
> >> +
> >> +static int notsupp_get_next_key(struct bpf_map *map, void *key, void
> >> *next_key)
> >> +{
> >> +    return -ENOTSUPP;
> >> +}
> >> +
> >> +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
> >> +{
> >> +    struct bpf_local_storage_map *smap;
> >> +
> >> +    smap = bpf_local_storage_map_alloc(attr);
> >> +    if (IS_ERR(smap))
> >> +        return ERR_CAST(smap);
> >> +
> >> +    smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache);
> >> +    return &smap->map;
> >> +}
> >> +
> >> +static void cgroup_storage_map_free(struct bpf_map *map)
> >> +{
> >> +    struct bpf_local_storage_map *smap;
> >> +
> >> +    smap = (struct bpf_local_storage_map *)map;
> >> +    bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx);
> >> +    bpf_local_storage_map_free(smap, NULL);
> >> +}
> >> +
> >> +/* *gfp_flags* is a hidden argument provided by the verifier */
> >> +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct
> >> cgroup *, cgroup,
> >> +       void *, value, u64, flags, gfp_t, gfp_flags)
> >> +{
> >> +    struct bpf_local_storage_data *sdata;
> >> +
> >> +    WARN_ON_ONCE(!bpf_rcu_lock_held());
> >> +    if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> >> +        return (unsigned long)NULL;
> >> +
> >> +    if (!cgroup)
> >> +        return (unsigned long)NULL;
> >> +
> >> +    if (!bpf_cgroup_storage_trylock())
> >> +        return (unsigned long)NULL;
> >> +
> >> +    sdata = cgroup_storage_lookup(cgroup, map, true);
> >> +    if (sdata)
> >> +        goto unlock;
> >> +
> >> +    /* only allocate new storage, when the cgroup is refcounted */
> >> +    if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> >> +        (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> >> +        sdata = bpf_local_storage_update(cgroup, (struct
> >> bpf_local_storage_map *)map,
> >> +                         value, BPF_NOEXIST, gfp_flags);
> >> +
> >> +unlock:
> >> +    bpf_cgroup_storage_unlock();
> >> +    return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned
> >> long)sdata->data;
> >> +}
> >> +
> >> +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct
> >> cgroup *, cgroup)
> >> +{
> >> +    int ret;
> >> +
> >> +    WARN_ON_ONCE(!bpf_rcu_lock_held());
> >> +    if (!cgroup)
> >> +        return -EINVAL;
> >> +
> >> +    if (!bpf_cgroup_storage_trylock())
> >> +        return -EBUSY;
> >> +
> >> +    ret = cgroup_storage_delete(cgroup, map);
> >> +    bpf_cgroup_storage_unlock();
> >> +    return ret;
> >> +}
> >> +
> >> +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct,
> >> bpf_local_storage_map)
> >> +const struct bpf_map_ops cgroup_local_storage_map_ops = {
> >> +    .map_meta_equal = bpf_map_meta_equal,
> >> +    .map_alloc_check = bpf_local_storage_map_alloc_check,
> >> +    .map_alloc = cgroup_storage_map_alloc,
> >> +    .map_free = cgroup_storage_map_free,
> >> +    .map_get_next_key = notsupp_get_next_key,
> >> +    .map_lookup_elem = bpf_cgroup_storage_lookup_elem,
> >> +    .map_update_elem = bpf_cgroup_storage_update_elem,
> >> +    .map_delete_elem = bpf_cgroup_storage_delete_elem,
> >> +    .map_check_btf = bpf_local_storage_map_check_btf,
> >> +    .map_btf_id = &cgroup_storage_map_btf_ids[0],
> >> +    .map_owner_storage_ptr = cgroup_storage_ptr,
> >> +};
> >> +
> >> +const struct bpf_func_proto bpf_cgroup_storage_get_proto = {
> >> +    .func        = bpf_cgroup_storage_get,
> >> +    .gpl_only    = false,
> >> +    .ret_type    = RET_PTR_TO_MAP_VALUE_OR_NULL,
> >> +    .arg1_type    = ARG_CONST_MAP_PTR,
> >> +    .arg2_type    = ARG_PTR_TO_BTF_ID,
> >> +    .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> >> +    .arg3_type    = ARG_PTR_TO_MAP_VALUE_OR_NULL,
> >> +    .arg4_type    = ARG_ANYTHING,
> >> +};
> >> +
> >> +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = {
> >> +    .func        = bpf_cgroup_storage_delete,
> >> +    .gpl_only    = false,
> >> +    .ret_type    = RET_INTEGER,
> >> +    .arg1_type    = ARG_CONST_MAP_PTR,
> >> +    .arg2_type    = ARG_PTR_TO_BTF_ID,
> >> +    .arg2_btf_id    = &bpf_cgroup_btf_id[0],
> >> +};
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index a6b04faed282..5c5bb08832ec 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> >>           return &bpf_dynptr_write_proto;
> >>       case BPF_FUNC_dynptr_data:
> >>           return &bpf_dynptr_data_proto;
> >> +#ifdef CONFIG_CGROUPS
> >> +    case BPF_FUNC_cgroup_local_storage_get:
> >> +        return &bpf_cgroup_storage_get_proto;
> >> +    case BPF_FUNC_cgroup_local_storage_delete:
> >> +        return &bpf_cgroup_storage_delete_proto;
> >> +#endif
> >>       default:
> >>           break;
> >>       }
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index 7b373a5e861f..e53c7fae6e22 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map,
> >> const struct btf *btf,
> >>               map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
> >>               map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
> >>               map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
> >> -            map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> >> +            map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
> >> +            map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> >>               return -ENOTSUPP;
> >>           if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
> >>               map->value_size) {
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 6f6d2d511c06..f36f6a3c0d50 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct
> >> bpf_verifier_env *env,
> >>               func_id != BPF_FUNC_task_storage_delete)
> >>               goto error;
> >>           break;
> >> +    case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> >> +        if (func_id != BPF_FUNC_cgroup_local_storage_get &&
> >> +            func_id != BPF_FUNC_cgroup_local_storage_delete)
> >> +            goto error;
> >> +        break;
> >>       case BPF_MAP_TYPE_BLOOM_FILTER:
> >>           if (func_id != BPF_FUNC_map_peek_elem &&
> >>               func_id != BPF_FUNC_map_push_elem)
> >> @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct
> >> bpf_verifier_env *env,
> >>           if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
> >>               goto error;
> >>           break;
> >> +    case BPF_FUNC_cgroup_local_storage_get:
> >> +    case BPF_FUNC_cgroup_local_storage_delete:
> >> +        if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE)
> >> +            goto error;
> >> +        break;
> >>       default:
> >>           break;
> >>       }
> >> @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct
> >> bpf_verifier_env *env,
> >>           case BPF_MAP_TYPE_INODE_STORAGE:
> >>           case BPF_MAP_TYPE_SK_STORAGE:
> >>           case BPF_MAP_TYPE_TASK_STORAGE:
> >> +        case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
> >>               break;
> >>           default:
> >>               verbose(env,
> >> @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct
> >> bpf_verifier_env *env)
> >
> >>           if (insn->imm == BPF_FUNC_task_storage_get ||
> >>               insn->imm == BPF_FUNC_sk_storage_get ||
> >> -            insn->imm == BPF_FUNC_inode_storage_get) {
> >> +            insn->imm == BPF_FUNC_inode_storage_get ||
> >> +            insn->imm == BPF_FUNC_cgroup_local_storage_get) {
> >>               if (env->prog->aux->sleepable)
> >>                   insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force
> >> __s32)GFP_KERNEL);
> >>               else
> >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> >> index 8ad2c267ff47..2fa2c950c7fb 100644
> >> --- a/kernel/cgroup/cgroup.c
> >> +++ b/kernel/cgroup/cgroup.c
> >> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
> >>           put_css_set_locked(cset->dom_cset);
> >>       }
> >
> >> +#ifdef CONFIG_BPF_SYSCALL
> >> +    bpf_local_cgroup_storage_free(cset->dfl_cgrp);
> >> +#endif
> >> +
> >>       kfree_rcu(cset, rcu_head);
> >>   }
> >
> >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> >> index 688552df95ca..179adaae4a9f 100644
> >> --- a/kernel/trace/bpf_trace.c
> >> +++ b/kernel/trace/bpf_trace.c
> >> @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id
> >> func_id, const struct bpf_prog *prog)
> >>           return &bpf_get_current_cgroup_id_proto;
> >>       case BPF_FUNC_get_current_ancestor_cgroup_id:
> >>           return &bpf_get_current_ancestor_cgroup_id_proto;
> >> +    case BPF_FUNC_cgroup_local_storage_get:
> >> +        return &bpf_cgroup_storage_get_proto;
> >> +    case BPF_FUNC_cgroup_local_storage_delete:
> >> +        return &bpf_cgroup_storage_delete_proto;
> >>   #endif
> >>       case BPF_FUNC_send_signal:
> >>           return &bpf_send_signal_proto;
> >> diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
> >> index c0e6690be82a..fdb0aff8cb5a 100755
> >> --- a/scripts/bpf_doc.py
> >> +++ b/scripts/bpf_doc.py
> >> @@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
> >>               'struct udp6_sock',
> >>               'struct unix_sock',
> >>               'struct task_struct',
> >> +            'struct cgroup',
> >
> >>               'struct __sk_buff',
> >>               'struct sk_msg_md',
> >> @@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
> >>               'struct udp6_sock',
> >>               'struct unix_sock',
> >>               'struct task_struct',
> >> +            'struct cgroup',
> >>               'struct path',
> >>               'struct btf_ptr',
> >>               'struct inode',
> >> diff --git a/tools/include/uapi/linux/bpf.h
> >> b/tools/include/uapi/linux/bpf.h
> >> index 17f61338f8f8..d918b4054297 100644
> >> --- a/tools/include/uapi/linux/bpf.h
> >> +++ b/tools/include/uapi/linux/bpf.h
> >> @@ -935,6 +935,7 @@ enum bpf_map_type {
> >>       BPF_MAP_TYPE_TASK_STORAGE,
> >>       BPF_MAP_TYPE_BLOOM_FILTER,
> >>       BPF_MAP_TYPE_USER_RINGBUF,
> >> +    BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
> >>   };
> >
> >>   /* Note that tracing related programs such as
> >> @@ -5435,6 +5436,42 @@ union bpf_attr {
> >>    *        **-E2BIG** if user-space has tried to publish a sample
> >> which is
> >>    *        larger than the size of the ring buffer, or which cannot fit
> >>    *        within a struct bpf_dynptr.
> >> + *
> >> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct
> >> cgroup *cgroup, void *value, u64 flags)
> >> + *    Description
> >> + *        Get a bpf_local_storage from the *cgroup*.
> >> + *
> >> + *        Logically, it could be thought of as getting the value from
> >> + *        a *map* with *cgroup* as the **key**.  From this
> >> + *        perspective,  the usage is not much different from
> >> + *        **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> >> + *        helper enforces the key must be a cgroup struct and the map
> >> must also
> >> + *        be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> >> + *
> >> + *        Underneath, the value is stored locally at *cgroup* instead of
> >> + *        the *map*.  The *map* is used as the bpf-local-storage
> >> + *        "type". The bpf-local-storage "type" (i.e. the *map*) is
> >> + *        searched against all bpf_local_storage residing at *cgroup*.
> >> + *
> >> + *        An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**)
> >> can be
> >> + *        used such that a new bpf_local_storage will be
> >> + *        created if one does not exist.  *value* can be used
> >> + *        together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> >> + *        the initial value of a bpf_local_storage.  If *value* is
> >> + *        **NULL**, the new bpf_local_storage will be zero initialized.
> >> + *    Return
> >> + *        A bpf_local_storage pointer is returned on success.
> >> + *
> >> + *        **NULL** if not found or there was an error in adding
> >> + *        a new bpf_local_storage.
> >> + *
> >> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct
> >> cgroup *cgroup)
> >> + *    Description
> >> + *        Delete a bpf_local_storage from a *cgroup*.
> >> + *    Return
> >> + *        0 on success.
> >> + *
> >> + *        **-ENOENT** if the bpf_local_storage cannot be found.
> >>    */
> >>   #define ___BPF_FUNC_MAPPER(FN, ctx...)            \
> >>       FN(unspec, 0, ##ctx)                \
> >> @@ -5647,6 +5684,8 @@ union bpf_attr {
> >>       FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)    \
> >>       FN(ktime_get_tai_ns, 208, ##ctx)        \
> >>       FN(user_ringbuf_drain, 209, ##ctx)        \
> >> +    FN(cgroup_local_storage_get, 210, ##ctx)    \
> >> +    FN(cgroup_local_storage_delete, 211, ##ctx)    \
> >>       /* */
> >
> >>   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER
> >> that don't
> >> --
> >> 2.30.2
> >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:01   ` sdf
  2022-10-17 18:25     ` Yosry Ahmed
  2022-10-17 19:23     ` Yonghong Song
@ 2022-10-17 22:26     ` Martin KaFai Lau
  2 siblings, 0 replies; 38+ messages in thread
From: Martin KaFai Lau @ 2022-10-17 22:26 UTC (permalink / raw)
  To: sdf
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, Yonghong Song

On 10/17/22 11:01 AM, sdf@google.com wrote:
>> +static bool bpf_cgroup_storage_trylock(void)
>> +{
>> +    migrate_disable();
>> +    if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
>> +        this_cpu_dec(bpf_cgroup_storage_busy);
>> +        migrate_enable();
>> +        return false;
>> +    }
>> +    return true;
>> +}
> 
> Task storage has lock/unlock/trylock; inode storage doesn't; why does
> cgroup need it as well?

This was added in bc235cdb423a2 to avoid deadlock for tracing program which can 
get a hold to the same task ptr easily with bpf_get_current_task_btf().  I 
believe there was no known way to hit this problem in inode storage, so inode 
storage does not use it.

The common tracing use case to get a hold of the cgroup ptr is through task 
(including bpf_get_current_task_btf()), so it seems to make sense to mimic the 
trylock here.  I have plan to relax it for all non-tracing programs like 
cgroup-bpf and bpf-lsm.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-14  4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
  2022-10-17 18:01   ` sdf
@ 2022-10-17 18:16   ` David Vernet
  2022-10-17 19:45     ` Yonghong Song
  1 sibling, 1 reply; 38+ messages in thread
From: David Vernet @ 2022-10-17 18:16 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

On Thu, Oct 13, 2022 at 09:56:30PM -0700, Yonghong Song wrote:
> Similar to sk/inode/task storage, implement similar cgroup local storage.
> 
> There already exists a local storage implementation for cgroup-attached
> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> bpf_get_local_storage(). But there are use cases such that non-cgroup
> attached bpf progs wants to access cgroup local storage data. For example,
> tc egress prog has access to sk and cgroup. It is possible to use
> sk local storage to emulate cgroup local storage by storing data in socket.
> But this is a waste as it could be lots of sockets belonging to a particular
> cgroup. Alternatively, a separate map can be created with cgroup id as the key.
> But this will introduce additional overhead to manipulate the new map.
> A cgroup local storage, similar to existing sk/inode/task storage,
> should help for this use case.
> 
> The life-cycle of storage is managed with the life-cycle of the
> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> with a callback to the bpf_cgroup_storage_free when cgroup itself
> is deleted.
> 
> The userspace map operations can be done by using a cgroup fd as a key
> passed to the lookup, update and delete operations.
> 
> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is used
> for cgroup storage available to non-cgroup-attached bpf programs. The two
> helpers are named as bpf_cgroup_local_storage_get() and
> bpf_cgroup_local_storage_delete().
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h             |   3 +
>  include/linux/bpf_types.h       |   1 +
>  include/linux/cgroup-defs.h     |   4 +
>  include/uapi/linux/bpf.h        |  39 +++++
>  kernel/bpf/Makefile             |   2 +-
>  kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
>  kernel/bpf/helpers.c            |   6 +
>  kernel/bpf/syscall.c            |   3 +-
>  kernel/bpf/verifier.c           |  14 +-
>  kernel/cgroup/cgroup.c          |   4 +
>  kernel/trace/bpf_trace.c        |   4 +
>  scripts/bpf_doc.py              |   2 +
>  tools/include/uapi/linux/bpf.h  |  39 +++++
>  13 files changed, 398 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e7d46d16032..1395a01c7f18 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
>  
>  const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id);
>  void bpf_task_storage_free(struct task_struct *task);
> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
>  bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
>  const struct btf_func_model *
>  bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
> @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto bpf_copy_from_user_task_proto;
>  extern const struct bpf_func_proto bpf_set_retval_proto;
>  extern const struct bpf_func_proto bpf_get_retval_proto;
>  extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
> +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
>  
>  const struct bpf_func_proto *tracing_prog_func_proto(
>    enum bpf_func_id func_id, const struct bpf_prog *prog);
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 2c6a4f2562a7..7a0362d7a0aa 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
>  #ifdef CONFIG_CGROUP_BPF
>  BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, cgroup_local_storage_map_ops)

Did you mean to compile this out if !CONFIG_CGROUP_BPF? It looks like
we're using CONFIG_BPF_SYSCALL elsewhere, which makes sense if we're
keeping CONFIG_CGROUP_BPF for programs attaching to cgroups. Or maybe we
should put it in CONFIG_CGROUPS, which is what we use when compiling
bpf_cgroup_storage.o and the other relevant helpers?

Also, would you mind please adding comments here explaining what the
difference is between these map types? In terms of readability,
BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE are
nearly identical, and adding to the confusion,
BPF_MAP_TYPE_CGROUP_STORAGE is itself accessed with the
bpf_get_local_storage() helper. I feel like we need to be quite verbose
about the difference here or users are going to be confused when trying
to figure out the differences between these map types.

>  #endif
>  BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 4bcf56b3491c..c6f4590dda68 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -504,6 +504,10 @@ struct cgroup {
>  	/* Used to store internal freezer state */
>  	struct cgroup_freezer_state freezer;
>  
> +#ifdef CONFIG_BPF_SYSCALL

As alluded to above, I assume this should _not_ be:

#ifdef CONFIG_CGROUP_BPF

Just wanted to highlight it to make sure we're being consistent.

> +	struct bpf_local_storage __rcu  *bpf_cgroup_storage;
> +#endif
> +
>  	/* ids of the ancestors at each level including self */
>  	u64 ancestor_ids[];
>  };
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 17f61338f8f8..d918b4054297 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -935,6 +935,7 @@ enum bpf_map_type {
>  	BPF_MAP_TYPE_TASK_STORAGE,
>  	BPF_MAP_TYPE_BLOOM_FILTER,
>  	BPF_MAP_TYPE_USER_RINGBUF,
> +	BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
>  };
>  
>  /* Note that tracing related programs such as
> @@ -5435,6 +5436,42 @@ union bpf_attr {
>   *		**-E2BIG** if user-space has tried to publish a sample which is
>   *		larger than the size of the ring buffer, or which cannot fit
>   *		within a struct bpf_dynptr.
> + *
> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)

I think it will be easy for users to get confused here with
bpf_get_local_storage(), which even mentions "cgroup local storage" in
the description:

3338  * void *bpf_get_local_storage(void *map, u64 flags)
3339  *      Description
3340  *              Get the pointer to the local storage area.
3341  *              The type and the size of the local storage is defined
3342  *              by the *map* argument.
3343  *              The *flags* meaning is specific for each map type,
3344  *              and has to be 0 for cgroup local storage.

It would have been nice if, instead of defining an entirely new helper,
we could update enum bpf_cgroup_storage_type to include a third type of
cgroup storage, something like:

BPF_CGROUP_STORAGE_LOCAL

That of course doesn't work for bpf_get_local_storage() though, which
doesn't take a struct cgroup * argument. So I think what you're
proposing is fine, though I would again suggest that we explicitly spell
out the difference between bpf_cgroup_local_storage_get() and
bpf_get_local_storage(). Alternatively, do we have any intention of
deprecating the older cgroup storage map types? What you're proposing
here feels like a more canonical and ergonomic API, so it'd be nice to
guide folks towards this as the proper cgroup local storage map at some
point.

Also, one more nit / thought, but should we change the name to:

void *bpf_cgroup_storage_get()

This more closely matches the equivalent for task local storage:
bpf_task_storage_get().

> + *	Description
> + *		Get a bpf_local_storage from the *cgroup*.
> + *
> + *		Logically, it could be thought of as getting the value from
> + *		a *map* with *cgroup* as the **key**.  From this
> + *		perspective,  the usage is not much different from
> + *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> + *		helper enforces the key must be a cgroup struct and the map must also
> + *		be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
> + *
> + *		Underneath, the value is stored locally at *cgroup* instead of
> + *		the *map*.  The *map* is used as the bpf-local-storage
> + *		"type". The bpf-local-storage "type" (i.e. the *map*) is
> + *		searched against all bpf_local_storage residing at *cgroup*.
> + *
> + *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
> + *		used such that a new bpf_local_storage will be
> + *		created if one does not exist.  *value* can be used
> + *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
> + *		the initial value of a bpf_local_storage.  If *value* is
> + *		**NULL**, the new bpf_local_storage will be zero initialized.
> + *	Return
> + *		A bpf_local_storage pointer is returned on success.
> + *
> + *		**NULL** if not found or there was an error in adding
> + *		a new bpf_local_storage.
> + *
> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct cgroup *cgroup)

Same question here r.e. name. Is bpf_cgroup_storage_delete() more
consistent with local storage existing helpers such as
bpf_task_storage_delete()?

> + *	Description
> + *		Delete a bpf_local_storage from a *cgroup*.
> + *	Return
> + *		0 on success.
> + *
> + *		**-ENOENT** if the bpf_local_storage cannot be found.
>   */
>  #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>  	FN(unspec, 0, ##ctx)				\
> @@ -5647,6 +5684,8 @@ union bpf_attr {
>  	FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)	\
>  	FN(ktime_get_tai_ns, 208, ##ctx)		\
>  	FN(user_ringbuf_drain, 209, ##ctx)		\
> +	FN(cgroup_local_storage_get, 210, ##ctx)	\
> +	FN(cgroup_local_storage_delete, 211, ##ctx)	\
>  	/* */
>  
>  /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 341c94f208f4..b02693f51978 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
>  obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
>  endif
>  ifeq ($(CONFIG_CGROUPS),y)
> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
>  endif
>  obj-$(CONFIG_CGROUP_BPF) += cgroup.o
>  ifeq ($(CONFIG_INET),y)
> diff --git a/kernel/bpf/bpf_cgroup_storage.c b/kernel/bpf/bpf_cgroup_storage.c
> new file mode 100644
> index 000000000000..9974784822da
> --- /dev/null
> +++ b/kernel/bpf/bpf_cgroup_storage.c
> @@ -0,0 +1,280 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/bpf.h>
> +#include <linux/bpf_local_storage.h>
> +#include <uapi/linux/btf.h>
> +#include <linux/btf_ids.h>
> +
> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> +
> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> +
> +static void bpf_cgroup_storage_lock(void)
> +{
> +	migrate_disable();
> +	this_cpu_inc(bpf_cgroup_storage_busy);
> +}
> +
> +static void bpf_cgroup_storage_unlock(void)
> +{
> +	this_cpu_dec(bpf_cgroup_storage_busy);
> +	migrate_enable();
> +}
> +
> +static bool bpf_cgroup_storage_trylock(void)
> +{
> +	migrate_disable();
> +	if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> +		this_cpu_dec(bpf_cgroup_storage_busy);
> +		migrate_enable();
> +		return false;
> +	}
> +	return true;
> +}
> +
> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> +{
> +	struct cgroup *cg = owner;
> +
> +	return &cg->bpf_cgroup_storage;
> +}
> +
> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> +{
> +	struct bpf_local_storage *local_storage;
> +	struct bpf_local_storage_elem *selem;
> +	bool free_cgroup_storage = false;
> +	struct hlist_node *n;
> +	unsigned long flags;
> +
> +	rcu_read_lock();
> +	local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> +	if (!local_storage) {
> +		rcu_read_unlock();
> +		return;
> +	}
> +
> +	/* Neither the bpf_prog nor the bpf-map's syscall
> +	 * could be modifying the local_storage->list now.
> +	 * Thus, no elem can be added-to or deleted-from the
> +	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> +	 *
> +	 * It is racing with bpf_local_storage_map_free() alone
> +	 * when unlinking elem from the local_storage->list and
> +	 * the map's bucket->list.
> +	 */
> +	bpf_cgroup_storage_lock();
> +	raw_spin_lock_irqsave(&local_storage->lock, flags);
> +	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> +		bpf_selem_unlink_map(selem);
> +		free_cgroup_storage =
> +			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);

Could this overwrite a previously-true free_cgroup_storage if one of
these entries is false? Did you mean to do something like this?

if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false))
	free_cgroup_storage = true;

> +	}
> +	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> +	bpf_cgroup_storage_unlock();
> +	rcu_read_unlock();
> +
> +	/* free_cgroup_storage should always be true as long as
> +	 * local_storage->list was non-empty.
> +	 */
> +	if (free_cgroup_storage)
> +		kfree_rcu(local_storage, rcu);
> +}
> +
> +static struct bpf_local_storage_data *
> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
> +{
> +	struct bpf_local_storage *cgroup_storage;
> +	struct bpf_local_storage_map *smap;
> +
> +	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> +					       bpf_rcu_lock_held());
> +	if (!cgroup_storage)
> +		return NULL;
> +
> +	smap = (struct bpf_local_storage_map *)map;
> +	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> +}
> +
> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	struct bpf_local_storage_data *sdata;
> +	struct cgroup *cgroup;
> +	int fd;
> +
> +	fd = *(int *)key;
> +	cgroup = cgroup_get_from_fd(fd);
> +	if (IS_ERR(cgroup))
> +		return ERR_CAST(cgroup);
> +
> +	bpf_cgroup_storage_lock();
> +	sdata = cgroup_storage_lookup(cgroup, map, true);
> +	bpf_cgroup_storage_unlock();
> +	cgroup_put(cgroup);
> +	return sdata ? sdata->data : NULL;
> +}
> +
> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
> +					  void *value, u64 map_flags)
> +{
> +	struct bpf_local_storage_data *sdata;
> +	struct cgroup *cgroup;
> +	int err, fd;
> +
> +	fd = *(int *)key;
> +	cgroup = cgroup_get_from_fd(fd);
> +	if (IS_ERR(cgroup))
> +		return PTR_ERR(cgroup);
> +
> +	bpf_cgroup_storage_lock();
> +	sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
> +					 value, map_flags, GFP_ATOMIC);
> +	bpf_cgroup_storage_unlock();
> +	err = PTR_ERR_OR_ZERO(sdata);
> +	cgroup_put(cgroup);
> +	return err;

Optional suggestion, but perhaps this is slightly more concise:

bpf_cgroup_storage_unlock();
cgroup_put(cgroup);
return PTR_ERR_OR_ZERO(sdata);

> +}
> +
> +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map)
> +{
> +	struct bpf_local_storage_data *sdata;
> +
> +	sdata = cgroup_storage_lookup(cgroup, map, false);
> +	if (!sdata)
> +		return -ENOENT;
> +
> +	bpf_selem_unlink(SELEM(sdata), true);
> +	return 0;
> +}
> +
> +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key)
> +{
> +	struct cgroup *cgroup;
> +	int err, fd;
> +
> +	fd = *(int *)key;
> +	cgroup = cgroup_get_from_fd(fd);
> +	if (IS_ERR(cgroup))
> +		return PTR_ERR(cgroup);
> +
> +	bpf_cgroup_storage_lock();
> +	err = cgroup_storage_delete(cgroup, map);
> +	bpf_cgroup_storage_unlock();
> +	if (err)
> +		return err;

Doesn't this error path leak the cgroup? Maybe this would be cleaner:

bpf_cgroup_storage_lock();
err = cgroup_storage_delete(cgroup, map);
bpf_cgroup_storage_unlock();
cgroup_put(cgroup);

return err;

> +
> +	cgroup_put(cgroup);
> +	return 0;
> +}
> +

[...]

Thanks,
David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-17 18:16   ` David Vernet
@ 2022-10-17 19:45     ` Yonghong Song
  0 siblings, 0 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-17 19:45 UTC (permalink / raw)
  To: David Vernet, Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo



On 10/17/22 11:16 AM, David Vernet wrote:
> On Thu, Oct 13, 2022 at 09:56:30PM -0700, Yonghong Song wrote:
>> Similar to sk/inode/task storage, implement similar cgroup local storage.
>>
>> There already exists a local storage implementation for cgroup-attached
>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>> attached bpf progs wants to access cgroup local storage data. For example,
>> tc egress prog has access to sk and cgroup. It is possible to use
>> sk local storage to emulate cgroup local storage by storing data in socket.
>> But this is a waste as it could be lots of sockets belonging to a particular
>> cgroup. Alternatively, a separate map can be created with cgroup id as the key.
>> But this will introduce additional overhead to manipulate the new map.
>> A cgroup local storage, similar to existing sk/inode/task storage,
>> should help for this use case.
>>
>> The life-cycle of storage is managed with the life-cycle of the
>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
>> with a callback to the bpf_cgroup_storage_free when cgroup itself
>> is deleted.
>>
>> The userspace map operations can be done by using a cgroup fd as a key
>> passed to the lookup, update and delete operations.
>>
>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is used
>> for cgroup storage available to non-cgroup-attached bpf programs. The two
>> helpers are named as bpf_cgroup_local_storage_get() and
>> bpf_cgroup_local_storage_delete().
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h             |   3 +
>>   include/linux/bpf_types.h       |   1 +
>>   include/linux/cgroup-defs.h     |   4 +
>>   include/uapi/linux/bpf.h        |  39 +++++
>>   kernel/bpf/Makefile             |   2 +-
>>   kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
>>   kernel/bpf/helpers.c            |   6 +
>>   kernel/bpf/syscall.c            |   3 +-
>>   kernel/bpf/verifier.c           |  14 +-
>>   kernel/cgroup/cgroup.c          |   4 +
>>   kernel/trace/bpf_trace.c        |   4 +
>>   scripts/bpf_doc.py              |   2 +
>>   tools/include/uapi/linux/bpf.h  |  39 +++++
>>   13 files changed, 398 insertions(+), 3 deletions(-)
>>   create mode 100644 kernel/bpf/bpf_cgroup_storage.c
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 9e7d46d16032..1395a01c7f18 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
>>   
>>   const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id);
>>   void bpf_task_storage_free(struct task_struct *task);
>> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup);
>>   bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
>>   const struct btf_func_model *
>>   bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
>> @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto bpf_copy_from_user_task_proto;
>>   extern const struct bpf_func_proto bpf_set_retval_proto;
>>   extern const struct bpf_func_proto bpf_get_retval_proto;
>>   extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
>> +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto;
>> +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto;
>>   
>>   const struct bpf_func_proto *tracing_prog_func_proto(
>>     enum bpf_func_id func_id, const struct bpf_prog *prog);
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index 2c6a4f2562a7..7a0362d7a0aa 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
>>   #ifdef CONFIG_CGROUP_BPF
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
>> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, cgroup_local_storage_map_ops)
> 
> Did you mean to compile this out if !CONFIG_CGROUP_BPF? It looks like
> we're using CONFIG_BPF_SYSCALL elsewhere, which makes sense if we're
> keeping CONFIG_CGROUP_BPF for programs attaching to cgroups. Or maybe we
> should put it in CONFIG_CGROUPS, which is what we use when compiling
> bpf_cgroup_storage.o and the other relevant helpers?

BPF_MAP_TYPE is defined as

#define BPF_MAP_TYPE(_id, _ops) \
         extern const struct bpf_map_ops _ops;

so it should be okay whether it is guarded by CONFIG_CGROUP_BPF
or CONFIG_CGROUPS.

I am aware some helper related codes/switch-cases are guarded
with CONFIG_CGRUOPS and I just added my helper there as well.
But I will double check that CONFIF_CGROUPS && !CONFIG_CGROUP_BPF
can compile properly.

> 
> Also, would you mind please adding comments here explaining what the
> difference is between these map types? In terms of readability,
> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE are
> nearly identical, and adding to the confusion,
> BPF_MAP_TYPE_CGROUP_STORAGE is itself accessed with the
> bpf_get_local_storage() helper. I feel like we need to be quite verbose
> about the difference here or users are going to be confused when trying
> to figure out the differences between these map types.

Agree. two very similar map names are confusing. I plan to reuse
the same map name BPF_MAP_TYPE_CGROUP_STORAGE and add a map-flag
to distinghish two use cases.

> 
>>   #endif
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
>>   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 4bcf56b3491c..c6f4590dda68 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -504,6 +504,10 @@ struct cgroup {
>>   	/* Used to store internal freezer state */
>>   	struct cgroup_freezer_state freezer;
>>   
>> +#ifdef CONFIG_BPF_SYSCALL
> 
> As alluded to above, I assume this should _not_ be:
> 
> #ifdef CONFIG_CGROUP_BPF
> 
> Just wanted to highlight it to make sure we're being consistent.

We should be okay here as
config CGROUP_BPF
         bool "Support for eBPF programs attached to cgroups"
         depends on BPF_SYSCALL
         select SOCK_CGROUP_DATA

But I can change to CONFIG_CGROUP_BPF.

> 
>> +	struct bpf_local_storage __rcu  *bpf_cgroup_storage;
>> +#endif
>> +
>>   	/* ids of the ancestors at each level including self */
>>   	u64 ancestor_ids[];
>>   };
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 17f61338f8f8..d918b4054297 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -935,6 +935,7 @@ enum bpf_map_type {
>>   	BPF_MAP_TYPE_TASK_STORAGE,
>>   	BPF_MAP_TYPE_BLOOM_FILTER,
>>   	BPF_MAP_TYPE_USER_RINGBUF,
>> +	BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE,
>>   };
>>   
>>   /* Note that tracing related programs such as
>> @@ -5435,6 +5436,42 @@ union bpf_attr {
>>    *		**-E2BIG** if user-space has tried to publish a sample which is
>>    *		larger than the size of the ring buffer, or which cannot fit
>>    *		within a struct bpf_dynptr.
>> + *
>> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
> 
> I think it will be easy for users to get confused here with
> bpf_get_local_storage(), which even mentions "cgroup local storage" in
> the description:
> 
> 3338  * void *bpf_get_local_storage(void *map, u64 flags)
> 3339  *      Description
> 3340  *              Get the pointer to the local storage area.
> 3341  *              The type and the size of the local storage is defined
> 3342  *              by the *map* argument.
> 3343  *              The *flags* meaning is specific for each map type,
> 3344  *              and has to be 0 for cgroup local storage.
> 
> It would have been nice if, instead of defining an entirely new helper,
> we could update enum bpf_cgroup_storage_type to include a third type of
> cgroup storage, something like:
> 
> BPF_CGROUP_STORAGE_LOCAL
> 
> That of course doesn't work for bpf_get_local_storage() though, which
> doesn't take a struct cgroup * argument. So I think what you're
> proposing is fine, though I would again suggest that we explicitly spell
> out the difference between bpf_cgroup_local_storage_get() and
> bpf_get_local_storage(). Alternatively, do we have any intention of
> deprecating the older cgroup storage map types? What you're proposing
> here feels like a more canonical and ergonomic API, so it'd be nice to
> guide folks towards this as the proper cgroup local storage map at some
> point.
> 
> Also, one more nit / thought, but should we change the name to:
> 
> void *bpf_cgroup_storage_get()

Ya, I plan to use this in the next revision.
Basically bpf_cgroup_storage_get/delete() can be used
if flag BPF_F_LOCAL_STORAGE_GENERIC is specified.
If the flag BPF_F_LOCAL_STORAGE_GENERIC is not specified,
the helper bpf_get_local_storage() can be used.

> 
> This more closely matches the equivalent for task local storage:
> bpf_task_storage_get().
> 
>> + *	Description
>> + *		Get a bpf_local_storage from the *cgroup*.
>> + *
>> + *		Logically, it could be thought of as getting the value from
>> + *		a *map* with *cgroup* as the **key**.  From this
>> + *		perspective,  the usage is not much different from
>> + *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
>> + *		helper enforces the key must be a cgroup struct and the map must also
>> + *		be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**.
>> + *
>> + *		Underneath, the value is stored locally at *cgroup* instead of
>> + *		the *map*.  The *map* is used as the bpf-local-storage
>> + *		"type". The bpf-local-storage "type" (i.e. the *map*) is
>> + *		searched against all bpf_local_storage residing at *cgroup*.
>> + *
>> + *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
>> + *		used such that a new bpf_local_storage will be
>> + *		created if one does not exist.  *value* can be used
>> + *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
>> + *		the initial value of a bpf_local_storage.  If *value* is
>> + *		**NULL**, the new bpf_local_storage will be zero initialized.
>> + *	Return
>> + *		A bpf_local_storage pointer is returned on success.
>> + *
>> + *		**NULL** if not found or there was an error in adding
>> + *		a new bpf_local_storage.
>> + *
>> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
> 
> Same question here r.e. name. Is bpf_cgroup_storage_delete() more
> consistent with local storage existing helpers such as
> bpf_task_storage_delete()?
> 
>> + *	Description
>> + *		Delete a bpf_local_storage from a *cgroup*.
>> + *	Return
>> + *		0 on success.
>> + *
>> + *		**-ENOENT** if the bpf_local_storage cannot be found.
>>    */
>>   #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
>>   	FN(unspec, 0, ##ctx)				\
>> @@ -5647,6 +5684,8 @@ union bpf_attr {
>>   	FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)	\
>>   	FN(ktime_get_tai_ns, 208, ##ctx)		\
>>   	FN(user_ringbuf_drain, 209, ##ctx)		\
>> +	FN(cgroup_local_storage_get, 210, ##ctx)	\
>> +	FN(cgroup_local_storage_delete, 211, ##ctx)	\
>>   	/* */
>>   
>>   /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>> index 341c94f208f4..b02693f51978 100644
>> --- a/kernel/bpf/Makefile
>> +++ b/kernel/bpf/Makefile
>> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
>>   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
>>   endif
>>   ifeq ($(CONFIG_CGROUPS),y)
>> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
>> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
>>   endif
>>   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
>>   ifeq ($(CONFIG_INET),y)
>> diff --git a/kernel/bpf/bpf_cgroup_storage.c b/kernel/bpf/bpf_cgroup_storage.c
>> new file mode 100644
>> index 000000000000..9974784822da
>> --- /dev/null
>> +++ b/kernel/bpf/bpf_cgroup_storage.c
>> @@ -0,0 +1,280 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
>> + */
>> +
>> +#include <linux/types.h>
>> +#include <linux/bpf.h>
>> +#include <linux/bpf_local_storage.h>
>> +#include <uapi/linux/btf.h>
>> +#include <linux/btf_ids.h>
>> +
>> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
>> +
>> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
>> +
>> +static void bpf_cgroup_storage_lock(void)
>> +{
>> +	migrate_disable();
>> +	this_cpu_inc(bpf_cgroup_storage_busy);
>> +}
>> +
>> +static void bpf_cgroup_storage_unlock(void)
>> +{
>> +	this_cpu_dec(bpf_cgroup_storage_busy);
>> +	migrate_enable();
>> +}
>> +
>> +static bool bpf_cgroup_storage_trylock(void)
>> +{
>> +	migrate_disable();
>> +	if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
>> +		this_cpu_dec(bpf_cgroup_storage_busy);
>> +		migrate_enable();
>> +		return false;
>> +	}
>> +	return true;
>> +}
>> +
>> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
>> +{
>> +	struct cgroup *cg = owner;
>> +
>> +	return &cg->bpf_cgroup_storage;
>> +}
>> +
>> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
>> +{
>> +	struct bpf_local_storage *local_storage;
>> +	struct bpf_local_storage_elem *selem;
>> +	bool free_cgroup_storage = false;
>> +	struct hlist_node *n;
>> +	unsigned long flags;
>> +
>> +	rcu_read_lock();
>> +	local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
>> +	if (!local_storage) {
>> +		rcu_read_unlock();
>> +		return;
>> +	}
>> +
>> +	/* Neither the bpf_prog nor the bpf-map's syscall
>> +	 * could be modifying the local_storage->list now.
>> +	 * Thus, no elem can be added-to or deleted-from the
>> +	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
>> +	 *
>> +	 * It is racing with bpf_local_storage_map_free() alone
>> +	 * when unlinking elem from the local_storage->list and
>> +	 * the map's bucket->list.
>> +	 */
>> +	bpf_cgroup_storage_lock();
>> +	raw_spin_lock_irqsave(&local_storage->lock, flags);
>> +	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>> +		bpf_selem_unlink_map(selem);
>> +		free_cgroup_storage =
>> +			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> 
> Could this overwrite a previously-true free_cgroup_storage if one of
> these entries is false? Did you mean to do something like this?

I will add a comment here. This should not be the case.
> 
> if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false))
> 	free_cgroup_storage = true;
> 
>> +	}
>> +	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>> +	bpf_cgroup_storage_unlock();
>> +	rcu_read_unlock();
>> +
>> +	/* free_cgroup_storage should always be true as long as
>> +	 * local_storage->list was non-empty.
>> +	 */
>> +	if (free_cgroup_storage)
>> +		kfree_rcu(local_storage, rcu);
>> +}
>> +
>> +static struct bpf_local_storage_data *
>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
>> +{
>> +	struct bpf_local_storage *cgroup_storage;
>> +	struct bpf_local_storage_map *smap;
>> +
>> +	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
>> +					       bpf_rcu_lock_held());
>> +	if (!cgroup_storage)
>> +		return NULL;
>> +
>> +	smap = (struct bpf_local_storage_map *)map;
>> +	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
>> +}
>> +
>> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void *key)
>> +{
>> +	struct bpf_local_storage_data *sdata;
>> +	struct cgroup *cgroup;
>> +	int fd;
>> +
>> +	fd = *(int *)key;
>> +	cgroup = cgroup_get_from_fd(fd);
>> +	if (IS_ERR(cgroup))
>> +		return ERR_CAST(cgroup);
>> +
>> +	bpf_cgroup_storage_lock();
>> +	sdata = cgroup_storage_lookup(cgroup, map, true);
>> +	bpf_cgroup_storage_unlock();
>> +	cgroup_put(cgroup);
>> +	return sdata ? sdata->data : NULL;
>> +}
>> +
>> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
>> +					  void *value, u64 map_flags)
>> +{
>> +	struct bpf_local_storage_data *sdata;
>> +	struct cgroup *cgroup;
>> +	int err, fd;
>> +
>> +	fd = *(int *)key;
>> +	cgroup = cgroup_get_from_fd(fd);
>> +	if (IS_ERR(cgroup))
>> +		return PTR_ERR(cgroup);
>> +
>> +	bpf_cgroup_storage_lock();
>> +	sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
>> +					 value, map_flags, GFP_ATOMIC);
>> +	bpf_cgroup_storage_unlock();
>> +	err = PTR_ERR_OR_ZERO(sdata);
>> +	cgroup_put(cgroup);
>> +	return err;
> 
> Optional suggestion, but perhaps this is slightly more concise:
> 
> bpf_cgroup_storage_unlock();
> cgroup_put(cgroup);
> return PTR_ERR_OR_ZERO(sdata);

Good idea. Will do.

> 
>> +}
>> +
>> +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map)
>> +{
>> +	struct bpf_local_storage_data *sdata;
>> +
>> +	sdata = cgroup_storage_lookup(cgroup, map, false);
>> +	if (!sdata)
>> +		return -ENOENT;
>> +
>> +	bpf_selem_unlink(SELEM(sdata), true);
>> +	return 0;
>> +}
>> +
>> +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +	struct cgroup *cgroup;
>> +	int err, fd;
>> +
>> +	fd = *(int *)key;
>> +	cgroup = cgroup_get_from_fd(fd);
>> +	if (IS_ERR(cgroup))
>> +		return PTR_ERR(cgroup);
>> +
>> +	bpf_cgroup_storage_lock();
>> +	err = cgroup_storage_delete(cgroup, map);
>> +	bpf_cgroup_storage_unlock();
>> +	if (err)
>> +		return err;
> 
> Doesn't this error path leak the cgroup? Maybe this would be cleaner:
> 
> bpf_cgroup_storage_lock();
> err = cgroup_storage_delete(cgroup, map);
> bpf_cgroup_storage_unlock();
> cgroup_put(cgroup);
> 
> return err;

Thanks for spotting this. Yes, 'return err' here will cause a cgrup 
reference leaking.

> 
>> +
>> +	cgroup_put(cgroup);
>> +	return 0;
>> +}
>> +
> 
> [...]
> 
> Thanks,
> David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 3/5] libbpf: Support new cgroup local storage
  2022-10-14  4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
  2022-10-14  4:56 ` [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global Yonghong Song
  2022-10-14  4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
@ 2022-10-14  4:56 ` Yonghong Song
  2022-10-14  4:56 ` [PATCH bpf-next 4/5] bpftool: " Yonghong Song
  2022-10-14  4:56 ` [PATCH bpf-next 5/5] selftests/bpf: Add selftests for " Yonghong Song
  4 siblings, 0 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-14  4:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	KP Singh, Martin KaFai Lau, Tejun Heo

Add support for new cgroup local storage.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/lib/bpf/libbpf.c        | 1 +
 tools/lib/bpf/libbpf_probes.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8c3f236c86e4..81359eeb5104 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -164,6 +164,7 @@ static const char * const map_type_name[] = {
 	[BPF_MAP_TYPE_TASK_STORAGE]		= "task_storage",
 	[BPF_MAP_TYPE_BLOOM_FILTER]		= "bloom_filter",
 	[BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
+	[BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE]	= "cgroup_local_storage",
 };
 
 static const char * const prog_type_name[] = {
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index f3a8e8e74eb8..e424de977007 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -221,6 +221,7 @@ static int probe_map_create(enum bpf_map_type map_type)
 	case BPF_MAP_TYPE_SK_STORAGE:
 	case BPF_MAP_TYPE_INODE_STORAGE:
 	case BPF_MAP_TYPE_TASK_STORAGE:
+	case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE:
 		btf_key_type_id = 1;
 		btf_value_type_id = 3;
 		value_size = 8;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 4/5] bpftool: Support new cgroup local storage
  2022-10-14  4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
                   ` (2 preceding siblings ...)
  2022-10-14  4:56 ` [PATCH bpf-next 3/5] libbpf: Support new cgroup local storage Yonghong Song
@ 2022-10-14  4:56 ` Yonghong Song
  2022-10-17 10:26   ` Quentin Monnet
  2022-10-14  4:56 ` [PATCH bpf-next 5/5] selftests/bpf: Add selftests for " Yonghong Song
  4 siblings, 1 reply; 38+ messages in thread
From: Yonghong Song @ 2022-10-14  4:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	KP Singh, Martin KaFai Lau, Tejun Heo

Add support for new cgroup local storage

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/bpf/bpftool/Documentation/bpftool-map.rst | 2 +-
 tools/bpf/bpftool/map.c                         | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 7f3b67a8b48f..4c591b10961e 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -55,7 +55,7 @@ MAP COMMANDS
 |		| **devmap** | **devmap_hash** | **sockmap** | **cpumap** | **xskmap** | **sockhash**
 |		| **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage**
 |		| **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** | **inode_storage**
-|		| **task_storage** | **bloom_filter** | **user_ringbuf** }
+|		| **task_storage** | **bloom_filter** | **user_ringbuf** | **cgroup_local_storage** }
 
 DESCRIPTION
 ===========
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 9a6ca9f31133..ab681dc65316 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -1459,7 +1459,7 @@ static int do_help(int argc, char **argv)
 		"                 devmap | devmap_hash | sockmap | cpumap | xskmap | sockhash |\n"
 		"                 cgroup_storage | reuseport_sockarray | percpu_cgroup_storage |\n"
 		"                 queue | stack | sk_storage | struct_ops | ringbuf | inode_storage |\n"
-		"                 task_storage | bloom_filter | user_ringbuf }\n"
+		"                 task_storage | bloom_filter | user_ringbuf | cgroup_local_storage }\n"
 		"       " HELP_SPEC_OPTIONS " |\n"
 		"                    {-f|--bpffs} | {-n|--nomount} }\n"
 		"",
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH bpf-next 4/5] bpftool: Support new cgroup local storage
  2022-10-14  4:56 ` [PATCH bpf-next 4/5] bpftool: " Yonghong Song
@ 2022-10-17 10:26   ` Quentin Monnet
  0 siblings, 0 replies; 38+ messages in thread
From: Quentin Monnet @ 2022-10-17 10:26 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	KP Singh, Martin KaFai Lau, Tejun Heo

2022-10-13 21:56 UTC-0700 ~ Yonghong Song <yhs@fb.com>
> Add support for new cgroup local storage
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  tools/bpf/bpftool/Documentation/bpftool-map.rst | 2 +-
>  tools/bpf/bpftool/map.c                         | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
> index 7f3b67a8b48f..4c591b10961e 100644
> --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
> +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
> @@ -55,7 +55,7 @@ MAP COMMANDS
>  |		| **devmap** | **devmap_hash** | **sockmap** | **cpumap** | **xskmap** | **sockhash**
>  |		| **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage**
>  |		| **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** | **inode_storage**
> -|		| **task_storage** | **bloom_filter** | **user_ringbuf** }
> +|		| **task_storage** | **bloom_filter** | **user_ringbuf** | **cgroup_local_storage** }
>  
>  DESCRIPTION
>  ===========
> diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
> index 9a6ca9f31133..ab681dc65316 100644
> --- a/tools/bpf/bpftool/map.c
> +++ b/tools/bpf/bpftool/map.c
> @@ -1459,7 +1459,7 @@ static int do_help(int argc, char **argv)
>  		"                 devmap | devmap_hash | sockmap | cpumap | xskmap | sockhash |\n"
>  		"                 cgroup_storage | reuseport_sockarray | percpu_cgroup_storage |\n"
>  		"                 queue | stack | sk_storage | struct_ops | ringbuf | inode_storage |\n"
> -		"                 task_storage | bloom_filter | user_ringbuf }\n"
> +		"                 task_storage | bloom_filter | user_ringbuf | cgroup_local_storage }\n"
>  		"       " HELP_SPEC_OPTIONS " |\n"
>  		"                    {-f|--bpffs} | {-n|--nomount} }\n"
>  		"",

Thanks for the bpftool update!

Acked-by: Quentin Monnet <quentin@isovalent.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH bpf-next 5/5] selftests/bpf: Add selftests for cgroup local storage
  2022-10-14  4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
                   ` (3 preceding siblings ...)
  2022-10-14  4:56 ` [PATCH bpf-next 4/5] bpftool: " Yonghong Song
@ 2022-10-14  4:56 ` Yonghong Song
  4 siblings, 0 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-14  4:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	KP Singh, Martin KaFai Lau, Tejun Heo

Add two tests for cgroup local storage, one to test bpf program helpers
and user space map APIs, and the other to test recursive fentry
triggering won't deadlock.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../bpf/prog_tests/cgroup_local_storage.c     | 92 +++++++++++++++++++
 .../bpf/progs/cgroup_local_storage.c          | 88 ++++++++++++++++++
 .../selftests/bpf/progs/cgroup_ls_recursion.c | 70 ++++++++++++++
 3 files changed, 250 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c b/tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c
new file mode 100644
index 000000000000..4fe8862d275c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+
+#define _GNU_SOURCE         /* See feature_test_macros(7) */
+#include <unistd.h>
+#include <sys/syscall.h>   /* For SYS_xxx definitions */
+#include <sys/types.h>
+#include <test_progs.h>
+#include "cgroup_local_storage.skel.h"
+#include "cgroup_ls_recursion.skel.h"
+
+static void test_sys_enter_exit(int cgroup_fd)
+{
+	struct cgroup_local_storage *skel;
+	long val1 = 1, val2 = 0;
+	int err;
+
+	skel = cgroup_local_storage__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		return;
+
+	/* populate a value in cg_storage_2 */
+	err = bpf_map_update_elem(bpf_map__fd(skel->maps.cg_storage_2), &cgroup_fd, &val1, BPF_ANY);
+	if (!ASSERT_OK(err, "map_update_elem"))
+		goto out;
+
+	/* check value */
+	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.cg_storage_2), &cgroup_fd, &val2);
+	if (!ASSERT_OK(err, "map_lookup_elem"))
+		goto out;
+	if (!ASSERT_EQ(val2, 1, "map_lookup_elem, invalid val"))
+		goto out;
+
+	/* delete value */
+	err = bpf_map_delete_elem(bpf_map__fd(skel->maps.cg_storage_2), &cgroup_fd);
+	if (!ASSERT_OK(err, "map_delete_elem"))
+		goto out;
+
+	skel->bss->target_pid = syscall(SYS_gettid);
+
+	err = cgroup_local_storage__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto out;
+
+	syscall(SYS_gettid);
+	syscall(SYS_gettid);
+
+	skel->bss->target_pid = 0;
+
+	/* 3x syscalls: 1x attach and 2x gettid */
+	ASSERT_EQ(skel->bss->enter_cnt, 3, "enter_cnt");
+	ASSERT_EQ(skel->bss->exit_cnt, 3, "exit_cnt");
+	ASSERT_EQ(skel->bss->mismatch_cnt, 0, "mismatch_cnt");
+out:
+	cgroup_local_storage__destroy(skel);
+}
+
+static void test_recursion(int cgroup_fd)
+{
+	struct cgroup_ls_recursion *skel;
+	int err;
+
+	skel = cgroup_ls_recursion__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		return;
+
+	err = cgroup_ls_recursion__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto out;
+
+	/* trigger sys_enter, make sure it does not cause deadlock */
+	syscall(SYS_gettid);
+
+out:
+	cgroup_ls_recursion__destroy(skel);
+}
+
+void test_cgroup_local_storage(void)
+{
+	int cgroup_fd;
+
+	cgroup_fd = test__join_cgroup("/cgroup_local_storage");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /cgroup_local_storage"))
+		return;
+
+	if (test__start_subtest("sys_enter_exit"))
+		test_sys_enter_exit(cgroup_fd);
+	if (test__start_subtest("recursion"))
+		test_recursion(cgroup_fd);
+
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_local_storage.c b/tools/testing/selftests/bpf/progs/cgroup_local_storage.c
new file mode 100644
index 000000000000..5098e99705c6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_local_storage.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} cg_storage_1 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} cg_storage_2 SEC(".maps");
+
+#define MAGIC_VALUE 0xabcd1234
+
+pid_t target_pid = 0;
+int mismatch_cnt = 0;
+int enter_cnt = 0;
+int exit_cnt = 0;
+
+SEC("tp_btf/sys_enter")
+int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+{
+	struct task_struct *task;
+	long *ptr;
+	int err;
+
+	task = bpf_get_current_task_btf();
+	if (task->pid != target_pid)
+		return 0;
+
+	/* populate value 0 */
+	ptr = bpf_cgroup_local_storage_get(&cg_storage_1, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+
+	/* delete value 0 */
+	err = bpf_cgroup_local_storage_delete(&cg_storage_1, task->cgroups->dfl_cgrp);
+	if (err)
+		return 0;
+
+	/* value is not available */
+	ptr = bpf_cgroup_local_storage_get(&cg_storage_1, task->cgroups->dfl_cgrp, 0, 0);
+	if (ptr)
+		return 0;
+
+	/* re-populate the value */
+	ptr = bpf_cgroup_local_storage_get(&cg_storage_1, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+	__sync_fetch_and_add(&enter_cnt, 1);
+	*ptr = MAGIC_VALUE + enter_cnt;
+
+	return 0;
+}
+
+SEC("tp_btf/sys_exit")
+int BPF_PROG(on_exit, struct pt_regs *regs, long id)
+{
+	struct task_struct *task;
+	long *ptr;
+
+	task = bpf_get_current_task_btf();
+	if (task->pid != target_pid)
+		return 0;
+
+	ptr = bpf_cgroup_local_storage_get(&cg_storage_1, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+
+	__sync_fetch_and_add(&exit_cnt, 1);
+	if (*ptr != MAGIC_VALUE + exit_cnt)
+		__sync_fetch_and_add(&mismatch_cnt, 1);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c
new file mode 100644
index 000000000000..862683b4cb1e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} map_a SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} map_b SEC(".maps");
+
+SEC("fentry/bpf_local_storage_lookup")
+int BPF_PROG(on_lookup)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+
+	bpf_cgroup_local_storage_delete(&map_a, task->cgroups->dfl_cgrp);
+	bpf_cgroup_local_storage_delete(&map_b, task->cgroups->dfl_cgrp);
+	return 0;
+}
+
+SEC("fentry/bpf_local_storage_update")
+int BPF_PROG(on_update)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	long *ptr;
+
+	ptr = bpf_cgroup_local_storage_get(&map_a, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (ptr)
+		*ptr += 1;
+
+	ptr = bpf_cgroup_local_storage_get(&map_b, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (ptr)
+		*ptr += 1;
+
+	return 0;
+}
+
+SEC("tp_btf/sys_enter")
+int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+{
+	struct task_struct *task;
+	long *ptr;
+
+	task = bpf_get_current_task_btf();
+	ptr = bpf_cgroup_local_storage_get(&map_a, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (ptr)
+		*ptr = 200;
+
+	ptr = bpf_cgroup_local_storage_get(&map_b, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (ptr)
+		*ptr = 100;
+	return 0;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-10-18 23:12 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-14  4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-14  4:56 ` [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global Yonghong Song
2022-10-14  4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-17 18:01   ` sdf
2022-10-17 18:25     ` Yosry Ahmed
2022-10-17 18:43       ` Stanislav Fomichev
2022-10-17 18:47         ` Yosry Ahmed
2022-10-17 19:07           ` Stanislav Fomichev
2022-10-17 19:11             ` Yosry Ahmed
2022-10-17 19:26               ` Tejun Heo
2022-10-17 21:07               ` Martin KaFai Lau
2022-10-17 21:23                 ` Yosry Ahmed
2022-10-17 23:55                   ` Martin KaFai Lau
2022-10-18  0:47                     ` Yosry Ahmed
2022-10-17 22:16                 ` sdf
2022-10-18  0:52                   ` Martin KaFai Lau
2022-10-18  5:59                     ` Yonghong Song
2022-10-18 17:08                       ` sdf
2022-10-18 17:17                         ` Alexei Starovoitov
2022-10-18 18:08                           ` Martin KaFai Lau
2022-10-18 18:11                             ` Yosry Ahmed
2022-10-18 18:26                               ` Yonghong Song
2022-10-18 23:12                           ` Andrii Nakryiko
2022-10-17 20:15           ` Yonghong Song
2022-10-17 20:18             ` Yosry Ahmed
2022-10-17 20:13         ` Yonghong Song
2022-10-17 20:10       ` Yonghong Song
2022-10-17 20:14         ` Yosry Ahmed
2022-10-17 20:29           ` Yonghong Song
2022-10-17 19:23     ` Yonghong Song
2022-10-17 21:03       ` Stanislav Fomichev
2022-10-17 22:26     ` Martin KaFai Lau
2022-10-17 18:16   ` David Vernet
2022-10-17 19:45     ` Yonghong Song
2022-10-14  4:56 ` [PATCH bpf-next 3/5] libbpf: Support new cgroup local storage Yonghong Song
2022-10-14  4:56 ` [PATCH bpf-next 4/5] bpftool: " Yonghong Song
2022-10-17 10:26   ` Quentin Monnet
2022-10-14  4:56 ` [PATCH bpf-next 5/5] selftests/bpf: Add selftests for " Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox