* [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs
@ 2022-10-14 4:56 Yonghong Song
2022-10-14 4:56 ` [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global Yonghong Song
` (4 more replies)
0 siblings, 5 replies; 38+ messages in thread
From: Yonghong Song @ 2022-10-14 4:56 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
KP Singh, Martin KaFai Lau, Tejun Heo
There already exists a local storage implementation for cgroup-attached
bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.
This patch implemented new cgroup local storage available to
non-cgroup-attached bpf programs. In the patch series, Patch 1
is a preparation patch. Patch 2 implemented new cgroup local storage
kernel support. Patches 3 and 4 implemented libbpf and bpftool support.
Patch 5 added two tests to validate kernel/libbpf implementations.
Yonghong Song (5):
bpf: Make struct cgroup btf id global
bpf: Implement cgroup storage available to non-cgroup-attached bpf
progs
libbpf: Support new cgroup local storage
bpftool: Support new cgroup local storage
selftests/bpf: Add selftests for cgroup local storage
include/linux/bpf.h | 3 +
include/linux/bpf_types.h | 1 +
include/linux/btf_ids.h | 1 +
include/linux/cgroup-defs.h | 4 +
include/uapi/linux/bpf.h | 39 +++
kernel/bpf/Makefile | 2 +-
kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++
kernel/bpf/cgroup_iter.c | 2 +-
kernel/bpf/helpers.c | 6 +
kernel/bpf/syscall.c | 3 +-
kernel/bpf/verifier.c | 14 +-
kernel/cgroup/cgroup.c | 4 +
kernel/trace/bpf_trace.c | 4 +
scripts/bpf_doc.py | 2 +
.../bpf/bpftool/Documentation/bpftool-map.rst | 2 +-
tools/bpf/bpftool/map.c | 2 +-
tools/include/uapi/linux/bpf.h | 39 +++
tools/lib/bpf/libbpf.c | 1 +
tools/lib/bpf/libbpf_probes.c | 1 +
.../bpf/prog_tests/cgroup_local_storage.c | 92 ++++++
.../bpf/progs/cgroup_local_storage.c | 88 ++++++
.../selftests/bpf/progs/cgroup_ls_recursion.c | 70 +++++
22 files changed, 654 insertions(+), 6 deletions(-)
create mode 100644 kernel/bpf/bpf_cgroup_storage.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c
create mode 100644 tools/testing/selftests/bpf/progs/cgroup_local_storage.c
create mode 100644 tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c
--
2.30.2
^ permalink raw reply [flat|nested] 38+ messages in thread* [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global 2022-10-14 4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song @ 2022-10-14 4:56 ` Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song ` (3 subsequent siblings) 4 siblings, 0 replies; 38+ messages in thread From: Yonghong Song @ 2022-10-14 4:56 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo Make struct cgroup btf id global so later patch can reuse the same btf id. Signed-off-by: Yonghong Song <yhs@fb.com> --- include/linux/btf_ids.h | 1 + kernel/bpf/cgroup_iter.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h index 2aea877d644f..c9744efd202f 100644 --- a/include/linux/btf_ids.h +++ b/include/linux/btf_ids.h @@ -265,5 +265,6 @@ MAX_BTF_TRACING_TYPE, }; extern u32 btf_tracing_ids[]; +extern u32 bpf_cgroup_btf_id[]; #endif diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c index 0d200a993489..c6ffc706d583 100644 --- a/kernel/bpf/cgroup_iter.c +++ b/kernel/bpf/cgroup_iter.c @@ -157,7 +157,7 @@ static const struct seq_operations cgroup_iter_seq_ops = { .show = cgroup_iter_seq_show, }; -BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup) +BTF_ID_LIST_GLOBAL_SINGLE(bpf_cgroup_btf_id, struct, cgroup) static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux) { -- 2.30.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-14 4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global Yonghong Song @ 2022-10-14 4:56 ` Yonghong Song 2022-10-17 18:01 ` sdf 2022-10-17 18:16 ` David Vernet 2022-10-14 4:56 ` [PATCH bpf-next 3/5] libbpf: Support new cgroup local storage Yonghong Song ` (2 subsequent siblings) 4 siblings, 2 replies; 38+ messages in thread From: Yonghong Song @ 2022-10-14 4:56 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo Similar to sk/inode/task storage, implement similar cgroup local storage. There already exists a local storage implementation for cgroup-attached bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper bpf_get_local_storage(). But there are use cases such that non-cgroup attached bpf progs wants to access cgroup local storage data. For example, tc egress prog has access to sk and cgroup. It is possible to use sk local storage to emulate cgroup local storage by storing data in socket. But this is a waste as it could be lots of sockets belonging to a particular cgroup. Alternatively, a separate map can be created with cgroup id as the key. But this will introduce additional overhead to manipulate the new map. A cgroup local storage, similar to existing sk/inode/task storage, should help for this use case. The life-cycle of storage is managed with the life-cycle of the cgroup struct. i.e. the storage is destroyed along with the owning cgroup with a callback to the bpf_cgroup_storage_free when cgroup itself is deleted. The userspace map operations can be done by using a cgroup fd as a key passed to the lookup, update and delete operations. Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is used for cgroup storage available to non-cgroup-attached bpf programs. The two helpers are named as bpf_cgroup_local_storage_get() and bpf_cgroup_local_storage_delete(). Signed-off-by: Yonghong Song <yhs@fb.com> --- include/linux/bpf.h | 3 + include/linux/bpf_types.h | 1 + include/linux/cgroup-defs.h | 4 + include/uapi/linux/bpf.h | 39 +++++ kernel/bpf/Makefile | 2 +- kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ kernel/bpf/helpers.c | 6 + kernel/bpf/syscall.c | 3 +- kernel/bpf/verifier.c | 14 +- kernel/cgroup/cgroup.c | 4 + kernel/trace/bpf_trace.c | 4 + scripts/bpf_doc.py | 2 + tools/include/uapi/linux/bpf.h | 39 +++++ 13 files changed, 398 insertions(+), 3 deletions(-) create mode 100644 kernel/bpf/bpf_cgroup_storage.c diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 9e7d46d16032..1395a01c7f18 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id); void bpf_task_storage_free(struct task_struct *task); +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); const struct btf_func_model * bpf_jit_find_kfunc_model(const struct bpf_prog *prog, @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto bpf_copy_from_user_task_proto; extern const struct bpf_func_proto bpf_set_retval_proto; extern const struct bpf_func_proto bpf_get_retval_proto; extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; const struct bpf_func_proto *tracing_prog_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 2c6a4f2562a7..7a0362d7a0aa 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops) #ifdef CONFIG_CGROUP_BPF BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, cgroup_local_storage_map_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 4bcf56b3491c..c6f4590dda68 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -504,6 +504,10 @@ struct cgroup { /* Used to store internal freezer state */ struct cgroup_freezer_state freezer; +#ifdef CONFIG_BPF_SYSCALL + struct bpf_local_storage __rcu *bpf_cgroup_storage; +#endif + /* ids of the ancestors at each level including self */ u64 ancestor_ids[]; }; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 17f61338f8f8..d918b4054297 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -935,6 +935,7 @@ enum bpf_map_type { BPF_MAP_TYPE_TASK_STORAGE, BPF_MAP_TYPE_BLOOM_FILTER, BPF_MAP_TYPE_USER_RINGBUF, + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, }; /* Note that tracing related programs such as @@ -5435,6 +5436,42 @@ union bpf_attr { * **-E2BIG** if user-space has tried to publish a sample which is * larger than the size of the ring buffer, or which cannot fit * within a struct bpf_dynptr. + * + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags) + * Description + * Get a bpf_local_storage from the *cgroup*. + * + * Logically, it could be thought of as getting the value from + * a *map* with *cgroup* as the **key**. From this + * perspective, the usage is not much different from + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this + * helper enforces the key must be a cgroup struct and the map must also + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. + * + * Underneath, the value is stored locally at *cgroup* instead of + * the *map*. The *map* is used as the bpf-local-storage + * "type". The bpf-local-storage "type" (i.e. the *map*) is + * searched against all bpf_local_storage residing at *cgroup*. + * + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be + * used such that a new bpf_local_storage will be + * created if one does not exist. *value* can be used + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify + * the initial value of a bpf_local_storage. If *value* is + * **NULL**, the new bpf_local_storage will be zero initialized. + * Return + * A bpf_local_storage pointer is returned on success. + * + * **NULL** if not found or there was an error in adding + * a new bpf_local_storage. + * + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct cgroup *cgroup) + * Description + * Delete a bpf_local_storage from a *cgroup*. + * Return + * 0 on success. + * + * **-ENOENT** if the bpf_local_storage cannot be found. */ #define ___BPF_FUNC_MAPPER(FN, ctx...) \ FN(unspec, 0, ##ctx) \ @@ -5647,6 +5684,8 @@ union bpf_attr { FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ FN(ktime_get_tai_ns, 208, ##ctx) \ FN(user_ringbuf_drain, 209, ##ctx) \ + FN(cgroup_local_storage_get, 210, ##ctx) \ + FN(cgroup_local_storage_delete, 211, ##ctx) \ /* */ /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 341c94f208f4..b02693f51978 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) obj-$(CONFIG_BPF_SYSCALL) += stackmap.o endif ifeq ($(CONFIG_CGROUPS),y) -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o endif obj-$(CONFIG_CGROUP_BPF) += cgroup.o ifeq ($(CONFIG_INET),y) diff --git a/kernel/bpf/bpf_cgroup_storage.c b/kernel/bpf/bpf_cgroup_storage.c new file mode 100644 index 000000000000..9974784822da --- /dev/null +++ b/kernel/bpf/bpf_cgroup_storage.c @@ -0,0 +1,280 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + */ + +#include <linux/types.h> +#include <linux/bpf.h> +#include <linux/bpf_local_storage.h> +#include <uapi/linux/btf.h> +#include <linux/btf_ids.h> + +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); + +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); + +static void bpf_cgroup_storage_lock(void) +{ + migrate_disable(); + this_cpu_inc(bpf_cgroup_storage_busy); +} + +static void bpf_cgroup_storage_unlock(void) +{ + this_cpu_dec(bpf_cgroup_storage_busy); + migrate_enable(); +} + +static bool bpf_cgroup_storage_trylock(void) +{ + migrate_disable(); + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { + this_cpu_dec(bpf_cgroup_storage_busy); + migrate_enable(); + return false; + } + return true; +} + +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) +{ + struct cgroup *cg = owner; + + return &cg->bpf_cgroup_storage; +} + +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) +{ + struct bpf_local_storage *local_storage; + struct bpf_local_storage_elem *selem; + bool free_cgroup_storage = false; + struct hlist_node *n; + unsigned long flags; + + rcu_read_lock(); + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); + if (!local_storage) { + rcu_read_unlock(); + return; + } + + /* Neither the bpf_prog nor the bpf-map's syscall + * could be modifying the local_storage->list now. + * Thus, no elem can be added-to or deleted-from the + * local_storage->list by the bpf_prog or by the bpf-map's syscall. + * + * It is racing with bpf_local_storage_map_free() alone + * when unlinking elem from the local_storage->list and + * the map's bucket->list. + */ + bpf_cgroup_storage_lock(); + raw_spin_lock_irqsave(&local_storage->lock, flags); + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { + bpf_selem_unlink_map(selem); + free_cgroup_storage = + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); + } + raw_spin_unlock_irqrestore(&local_storage->lock, flags); + bpf_cgroup_storage_unlock(); + rcu_read_unlock(); + + /* free_cgroup_storage should always be true as long as + * local_storage->list was non-empty. + */ + if (free_cgroup_storage) + kfree_rcu(local_storage, rcu); +} + +static struct bpf_local_storage_data * +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit) +{ + struct bpf_local_storage *cgroup_storage; + struct bpf_local_storage_map *smap; + + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, + bpf_rcu_lock_held()); + if (!cgroup_storage) + return NULL; + + smap = (struct bpf_local_storage_map *)map; + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); +} + +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void *key) +{ + struct bpf_local_storage_data *sdata; + struct cgroup *cgroup; + int fd; + + fd = *(int *)key; + cgroup = cgroup_get_from_fd(fd); + if (IS_ERR(cgroup)) + return ERR_CAST(cgroup); + + bpf_cgroup_storage_lock(); + sdata = cgroup_storage_lookup(cgroup, map, true); + bpf_cgroup_storage_unlock(); + cgroup_put(cgroup); + return sdata ? sdata->data : NULL; +} + +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, + void *value, u64 map_flags) +{ + struct bpf_local_storage_data *sdata; + struct cgroup *cgroup; + int err, fd; + + fd = *(int *)key; + cgroup = cgroup_get_from_fd(fd); + if (IS_ERR(cgroup)) + return PTR_ERR(cgroup); + + bpf_cgroup_storage_lock(); + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map, + value, map_flags, GFP_ATOMIC); + bpf_cgroup_storage_unlock(); + err = PTR_ERR_OR_ZERO(sdata); + cgroup_put(cgroup); + return err; +} + +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map) +{ + struct bpf_local_storage_data *sdata; + + sdata = cgroup_storage_lookup(cgroup, map, false); + if (!sdata) + return -ENOENT; + + bpf_selem_unlink(SELEM(sdata), true); + return 0; +} + +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key) +{ + struct cgroup *cgroup; + int err, fd; + + fd = *(int *)key; + cgroup = cgroup_get_from_fd(fd); + if (IS_ERR(cgroup)) + return PTR_ERR(cgroup); + + bpf_cgroup_storage_lock(); + err = cgroup_storage_delete(cgroup, map); + bpf_cgroup_storage_unlock(); + if (err) + return err; + + cgroup_put(cgroup); + return 0; +} + +static int notsupp_get_next_key(struct bpf_map *map, void *key, void *next_key) +{ + return -ENOTSUPP; +} + +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) +{ + struct bpf_local_storage_map *smap; + + smap = bpf_local_storage_map_alloc(attr); + if (IS_ERR(smap)) + return ERR_CAST(smap); + + smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache); + return &smap->map; +} + +static void cgroup_storage_map_free(struct bpf_map *map) +{ + struct bpf_local_storage_map *smap; + + smap = (struct bpf_local_storage_map *)map; + bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx); + bpf_local_storage_map_free(smap, NULL); +} + +/* *gfp_flags* is a hidden argument provided by the verifier */ +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup *, cgroup, + void *, value, u64, flags, gfp_t, gfp_flags) +{ + struct bpf_local_storage_data *sdata; + + WARN_ON_ONCE(!bpf_rcu_lock_held()); + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) + return (unsigned long)NULL; + + if (!cgroup) + return (unsigned long)NULL; + + if (!bpf_cgroup_storage_trylock()) + return (unsigned long)NULL; + + sdata = cgroup_storage_lookup(cgroup, map, true); + if (sdata) + goto unlock; + + /* only allocate new storage, when the cgroup is refcounted */ + if (!percpu_ref_is_dying(&cgroup->self.refcnt) && + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map, + value, BPF_NOEXIST, gfp_flags); + +unlock: + bpf_cgroup_storage_unlock(); + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data; +} + +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct cgroup *, cgroup) +{ + int ret; + + WARN_ON_ONCE(!bpf_rcu_lock_held()); + if (!cgroup) + return -EINVAL; + + if (!bpf_cgroup_storage_trylock()) + return -EBUSY; + + ret = cgroup_storage_delete(cgroup, map); + bpf_cgroup_storage_unlock(); + return ret; +} + +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map) +const struct bpf_map_ops cgroup_local_storage_map_ops = { + .map_meta_equal = bpf_map_meta_equal, + .map_alloc_check = bpf_local_storage_map_alloc_check, + .map_alloc = cgroup_storage_map_alloc, + .map_free = cgroup_storage_map_free, + .map_get_next_key = notsupp_get_next_key, + .map_lookup_elem = bpf_cgroup_storage_lookup_elem, + .map_update_elem = bpf_cgroup_storage_update_elem, + .map_delete_elem = bpf_cgroup_storage_delete_elem, + .map_check_btf = bpf_local_storage_map_check_btf, + .map_btf_id = &cgroup_storage_map_btf_ids[0], + .map_owner_storage_ptr = cgroup_storage_ptr, +}; + +const struct bpf_func_proto bpf_cgroup_storage_get_proto = { + .func = bpf_cgroup_storage_get, + .gpl_only = false, + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_BTF_ID, + .arg2_btf_id = &bpf_cgroup_btf_id[0], + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, + .arg4_type = ARG_ANYTHING, +}; + +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = { + .func = bpf_cgroup_storage_delete, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_BTF_ID, + .arg2_btf_id = &bpf_cgroup_btf_id[0], +}; diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a6b04faed282..5c5bb08832ec 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_dynptr_write_proto; case BPF_FUNC_dynptr_data: return &bpf_dynptr_data_proto; +#ifdef CONFIG_CGROUPS + case BPF_FUNC_cgroup_local_storage_get: + return &bpf_cgroup_storage_get_proto; + case BPF_FUNC_cgroup_local_storage_delete: + return &bpf_cgroup_storage_delete_proto; +#endif default: break; } diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 7b373a5e861f..e53c7fae6e22 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && map->map_type != BPF_MAP_TYPE_SK_STORAGE && map->map_type != BPF_MAP_TYPE_INODE_STORAGE && - map->map_type != BPF_MAP_TYPE_TASK_STORAGE) + map->map_type != BPF_MAP_TYPE_TASK_STORAGE && + map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) return -ENOTSUPP; if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > map->value_size) { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 6f6d2d511c06..f36f6a3c0d50 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_task_storage_delete) goto error; break; + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: + if (func_id != BPF_FUNC_cgroup_local_storage_get && + func_id != BPF_FUNC_cgroup_local_storage_delete) + goto error; + break; case BPF_MAP_TYPE_BLOOM_FILTER: if (func_id != BPF_FUNC_map_peek_elem && func_id != BPF_FUNC_map_push_elem) @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) goto error; break; + case BPF_FUNC_cgroup_local_storage_get: + case BPF_FUNC_cgroup_local_storage_delete: + if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) + goto error; + break; default: break; } @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, case BPF_MAP_TYPE_INODE_STORAGE: case BPF_MAP_TYPE_SK_STORAGE: case BPF_MAP_TYPE_TASK_STORAGE: + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: break; default: verbose(env, @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env *env) if (insn->imm == BPF_FUNC_task_storage_get || insn->imm == BPF_FUNC_sk_storage_get || - insn->imm == BPF_FUNC_inode_storage_get) { + insn->imm == BPF_FUNC_inode_storage_get || + insn->imm == BPF_FUNC_cgroup_local_storage_get) { if (env->prog->aux->sleepable) insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL); else diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 8ad2c267ff47..2fa2c950c7fb 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) put_css_set_locked(cset->dom_cset); } +#ifdef CONFIG_BPF_SYSCALL + bpf_local_cgroup_storage_free(cset->dfl_cgrp); +#endif + kfree_rcu(cset, rcu_head); } diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 688552df95ca..179adaae4a9f 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_current_cgroup_id_proto; case BPF_FUNC_get_current_ancestor_cgroup_id: return &bpf_get_current_ancestor_cgroup_id_proto; + case BPF_FUNC_cgroup_local_storage_get: + return &bpf_cgroup_storage_get_proto; + case BPF_FUNC_cgroup_local_storage_delete: + return &bpf_cgroup_storage_delete_proto; #endif case BPF_FUNC_send_signal: return &bpf_send_signal_proto; diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py index c0e6690be82a..fdb0aff8cb5a 100755 --- a/scripts/bpf_doc.py +++ b/scripts/bpf_doc.py @@ -685,6 +685,7 @@ class PrinterHelpers(Printer): 'struct udp6_sock', 'struct unix_sock', 'struct task_struct', + 'struct cgroup', 'struct __sk_buff', 'struct sk_msg_md', @@ -742,6 +743,7 @@ class PrinterHelpers(Printer): 'struct udp6_sock', 'struct unix_sock', 'struct task_struct', + 'struct cgroup', 'struct path', 'struct btf_ptr', 'struct inode', diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 17f61338f8f8..d918b4054297 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -935,6 +935,7 @@ enum bpf_map_type { BPF_MAP_TYPE_TASK_STORAGE, BPF_MAP_TYPE_BLOOM_FILTER, BPF_MAP_TYPE_USER_RINGBUF, + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, }; /* Note that tracing related programs such as @@ -5435,6 +5436,42 @@ union bpf_attr { * **-E2BIG** if user-space has tried to publish a sample which is * larger than the size of the ring buffer, or which cannot fit * within a struct bpf_dynptr. + * + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags) + * Description + * Get a bpf_local_storage from the *cgroup*. + * + * Logically, it could be thought of as getting the value from + * a *map* with *cgroup* as the **key**. From this + * perspective, the usage is not much different from + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this + * helper enforces the key must be a cgroup struct and the map must also + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. + * + * Underneath, the value is stored locally at *cgroup* instead of + * the *map*. The *map* is used as the bpf-local-storage + * "type". The bpf-local-storage "type" (i.e. the *map*) is + * searched against all bpf_local_storage residing at *cgroup*. + * + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be + * used such that a new bpf_local_storage will be + * created if one does not exist. *value* can be used + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify + * the initial value of a bpf_local_storage. If *value* is + * **NULL**, the new bpf_local_storage will be zero initialized. + * Return + * A bpf_local_storage pointer is returned on success. + * + * **NULL** if not found or there was an error in adding + * a new bpf_local_storage. + * + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct cgroup *cgroup) + * Description + * Delete a bpf_local_storage from a *cgroup*. + * Return + * 0 on success. + * + * **-ENOENT** if the bpf_local_storage cannot be found. */ #define ___BPF_FUNC_MAPPER(FN, ctx...) \ FN(unspec, 0, ##ctx) \ @@ -5647,6 +5684,8 @@ union bpf_attr { FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ FN(ktime_get_tai_ns, 208, ##ctx) \ FN(user_ringbuf_drain, 209, ##ctx) \ + FN(cgroup_local_storage_get, 210, ##ctx) \ + FN(cgroup_local_storage_delete, 211, ##ctx) \ /* */ /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't -- 2.30.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-14 4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song @ 2022-10-17 18:01 ` sdf 2022-10-17 18:25 ` Yosry Ahmed ` (2 more replies) 2022-10-17 18:16 ` David Vernet 1 sibling, 3 replies; 38+ messages in thread From: sdf @ 2022-10-17 18:01 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/13, Yonghong Song wrote: > Similar to sk/inode/task storage, implement similar cgroup local storage. > There already exists a local storage implementation for cgroup-attached > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > bpf_get_local_storage(). But there are use cases such that non-cgroup > attached bpf progs wants to access cgroup local storage data. For example, > tc egress prog has access to sk and cgroup. It is possible to use > sk local storage to emulate cgroup local storage by storing data in > socket. > But this is a waste as it could be lots of sockets belonging to a > particular > cgroup. Alternatively, a separate map can be created with cgroup id as > the key. > But this will introduce additional overhead to manipulate the new map. > A cgroup local storage, similar to existing sk/inode/task storage, > should help for this use case. > The life-cycle of storage is managed with the life-cycle of the > cgroup struct. i.e. the storage is destroyed along with the owning cgroup > with a callback to the bpf_cgroup_storage_free when cgroup itself > is deleted. > The userspace map operations can be done by using a cgroup fd as a key > passed to the lookup, update and delete operations. [..] > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > local > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > used > for cgroup storage available to non-cgroup-attached bpf programs. The two > helpers are named as bpf_cgroup_local_storage_get() and > bpf_cgroup_local_storage_delete(). Have you considered doing something similar to 7d9c3427894f ("bpf: Make cgroup storages shared between programs on the same cgroup") where the map changes its behavior depending on the key size (see key_size checks in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still can be used so we can, in theory, reuse the name.. Pros: - no need for a new map name Cons: - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a good idea to add more stuff to it? But, for the very least, should we also extend Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've tried to keep some of the important details in there.. > Signed-off-by: Yonghong Song <yhs@fb.com> > --- > include/linux/bpf.h | 3 + > include/linux/bpf_types.h | 1 + > include/linux/cgroup-defs.h | 4 + > include/uapi/linux/bpf.h | 39 +++++ > kernel/bpf/Makefile | 2 +- > kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > kernel/bpf/helpers.c | 6 + > kernel/bpf/syscall.c | 3 +- > kernel/bpf/verifier.c | 14 +- > kernel/cgroup/cgroup.c | 4 + > kernel/trace/bpf_trace.c | 4 + > scripts/bpf_doc.py | 2 + > tools/include/uapi/linux/bpf.h | 39 +++++ > 13 files changed, 398 insertions(+), 3 deletions(-) > create mode 100644 kernel/bpf/bpf_cgroup_storage.c > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 9e7d46d16032..1395a01c7f18 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id > func_id); > void bpf_task_storage_free(struct task_struct *task); > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); > bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); > const struct btf_func_model * > bpf_jit_find_kfunc_model(const struct bpf_prog *prog, > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto > bpf_copy_from_user_task_proto; > extern const struct bpf_func_proto bpf_set_retval_proto; > extern const struct bpf_func_proto bpf_get_retval_proto; > extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; > const struct bpf_func_proto *tracing_prog_func_proto( > enum bpf_func_id func_id, const struct bpf_prog *prog); > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > index 2c6a4f2562a7..7a0362d7a0aa 100644 > --- a/include/linux/bpf_types.h > +++ b/include/linux/bpf_types.h > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, > cgroup_array_map_ops) > #ifdef CONFIG_CGROUP_BPF > BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > cgroup_local_storage_map_ops) > #endif > BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 4bcf56b3491c..c6f4590dda68 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -504,6 +504,10 @@ struct cgroup { > /* Used to store internal freezer state */ > struct cgroup_freezer_state freezer; > +#ifdef CONFIG_BPF_SYSCALL > + struct bpf_local_storage __rcu *bpf_cgroup_storage; > +#endif > + > /* ids of the ancestors at each level including self */ > u64 ancestor_ids[]; > }; > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 17f61338f8f8..d918b4054297 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -935,6 +935,7 @@ enum bpf_map_type { > BPF_MAP_TYPE_TASK_STORAGE, > BPF_MAP_TYPE_BLOOM_FILTER, > BPF_MAP_TYPE_USER_RINGBUF, > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > }; > /* Note that tracing related programs such as > @@ -5435,6 +5436,42 @@ union bpf_attr { > * **-E2BIG** if user-space has tried to publish a sample which is > * larger than the size of the ring buffer, or which cannot fit > * within a struct bpf_dynptr. > + * > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > *cgroup, void *value, u64 flags) > + * Description > + * Get a bpf_local_storage from the *cgroup*. > + * > + * Logically, it could be thought of as getting the value from > + * a *map* with *cgroup* as the **key**. From this > + * perspective, the usage is not much different from > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > + * helper enforces the key must be a cgroup struct and the map must also > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > + * > + * Underneath, the value is stored locally at *cgroup* instead of > + * the *map*. The *map* is used as the bpf-local-storage > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > + * searched against all bpf_local_storage residing at *cgroup*. > + * > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > + * used such that a new bpf_local_storage will be > + * created if one does not exist. *value* can be used > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > + * the initial value of a bpf_local_storage. If *value* is > + * **NULL**, the new bpf_local_storage will be zero initialized. > + * Return > + * A bpf_local_storage pointer is returned on success. > + * > + * **NULL** if not found or there was an error in adding > + * a new bpf_local_storage. > + * > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > cgroup *cgroup) > + * Description > + * Delete a bpf_local_storage from a *cgroup*. > + * Return > + * 0 on success. > + * > + * **-ENOENT** if the bpf_local_storage cannot be found. > */ > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > FN(unspec, 0, ##ctx) \ > @@ -5647,6 +5684,8 @@ union bpf_attr { > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > FN(ktime_get_tai_ns, 208, ##ctx) \ > FN(user_ringbuf_drain, 209, ##ctx) \ > + FN(cgroup_local_storage_get, 210, ##ctx) \ > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > /* */ > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > don't > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > index 341c94f208f4..b02693f51978 100644 > --- a/kernel/bpf/Makefile > +++ b/kernel/bpf/Makefile > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > endif > ifeq ($(CONFIG_CGROUPS),y) > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o > endif > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > ifeq ($(CONFIG_INET),y) > diff --git a/kernel/bpf/bpf_cgroup_storage.c > b/kernel/bpf/bpf_cgroup_storage.c > new file mode 100644 > index 000000000000..9974784822da > --- /dev/null > +++ b/kernel/bpf/bpf_cgroup_storage.c > @@ -0,0 +1,280 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > + */ > + > +#include <linux/types.h> > +#include <linux/bpf.h> > +#include <linux/bpf_local_storage.h> > +#include <uapi/linux/btf.h> > +#include <linux/btf_ids.h> > + > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > + > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); > + > +static void bpf_cgroup_storage_lock(void) > +{ > + migrate_disable(); > + this_cpu_inc(bpf_cgroup_storage_busy); > +} > + > +static void bpf_cgroup_storage_unlock(void) > +{ > + this_cpu_dec(bpf_cgroup_storage_busy); > + migrate_enable(); > +} > + > +static bool bpf_cgroup_storage_trylock(void) > +{ > + migrate_disable(); > + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { > + this_cpu_dec(bpf_cgroup_storage_busy); > + migrate_enable(); > + return false; > + } > + return true; > +} Task storage has lock/unlock/trylock; inode storage doesn't; why does cgroup need it as well? > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > +{ > + struct cgroup *cg = owner; > + > + return &cg->bpf_cgroup_storage; > +} > + > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) > +{ > + struct bpf_local_storage *local_storage; > + struct bpf_local_storage_elem *selem; > + bool free_cgroup_storage = false; > + struct hlist_node *n; > + unsigned long flags; > + > + rcu_read_lock(); > + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); > + if (!local_storage) { > + rcu_read_unlock(); > + return; > + } > + > + /* Neither the bpf_prog nor the bpf-map's syscall > + * could be modifying the local_storage->list now. > + * Thus, no elem can be added-to or deleted-from the > + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > + * > + * It is racing with bpf_local_storage_map_free() alone > + * when unlinking elem from the local_storage->list and > + * the map's bucket->list. > + */ > + bpf_cgroup_storage_lock(); > + raw_spin_lock_irqsave(&local_storage->lock, flags); > + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > + bpf_selem_unlink_map(selem); > + free_cgroup_storage = > + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > + } > + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > + bpf_cgroup_storage_unlock(); > + rcu_read_unlock(); > + > + /* free_cgroup_storage should always be true as long as > + * local_storage->list was non-empty. > + */ > + if (free_cgroup_storage) > + kfree_rcu(local_storage, rcu); > +} > +static struct bpf_local_storage_data * > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool > cacheit_lockit) > +{ > + struct bpf_local_storage *cgroup_storage; > + struct bpf_local_storage_map *smap; > + > + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, > + bpf_rcu_lock_held()); > + if (!cgroup_storage) > + return NULL; > + > + smap = (struct bpf_local_storage_map *)map; > + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > +} > + > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void > *key) > +{ > + struct bpf_local_storage_data *sdata; > + struct cgroup *cgroup; > + int fd; > + > + fd = *(int *)key; > + cgroup = cgroup_get_from_fd(fd); > + if (IS_ERR(cgroup)) > + return ERR_CAST(cgroup); > + > + bpf_cgroup_storage_lock(); > + sdata = cgroup_storage_lookup(cgroup, map, true); > + bpf_cgroup_storage_unlock(); > + cgroup_put(cgroup); > + return sdata ? sdata->data : NULL; > +} A lot of the above (free/lookup) seems to be copy-pasted from the task storage; any point in trying to generalize the common parts? > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, > + void *value, u64 map_flags) > +{ > + struct bpf_local_storage_data *sdata; > + struct cgroup *cgroup; > + int err, fd; > + > + fd = *(int *)key; > + cgroup = cgroup_get_from_fd(fd); > + if (IS_ERR(cgroup)) > + return PTR_ERR(cgroup); > + > + bpf_cgroup_storage_lock(); > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > *)map, > + value, map_flags, GFP_ATOMIC); > + bpf_cgroup_storage_unlock(); > + err = PTR_ERR_OR_ZERO(sdata); > + cgroup_put(cgroup); > + return err; > +} > + > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map > *map) > +{ > + struct bpf_local_storage_data *sdata; > + > + sdata = cgroup_storage_lookup(cgroup, map, false); > + if (!sdata) > + return -ENOENT; > + > + bpf_selem_unlink(SELEM(sdata), true); > + return 0; > +} > + > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key) > +{ > + struct cgroup *cgroup; > + int err, fd; > + > + fd = *(int *)key; > + cgroup = cgroup_get_from_fd(fd); > + if (IS_ERR(cgroup)) > + return PTR_ERR(cgroup); > + > + bpf_cgroup_storage_lock(); > + err = cgroup_storage_delete(cgroup, map); > + bpf_cgroup_storage_unlock(); > + if (err) > + return err; > + > + cgroup_put(cgroup); > + return 0; > +} > + > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void > *next_key) > +{ > + return -ENOTSUPP; > +} > + > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) > +{ > + struct bpf_local_storage_map *smap; > + > + smap = bpf_local_storage_map_alloc(attr); > + if (IS_ERR(smap)) > + return ERR_CAST(smap); > + > + smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache); > + return &smap->map; > +} > + > +static void cgroup_storage_map_free(struct bpf_map *map) > +{ > + struct bpf_local_storage_map *smap; > + > + smap = (struct bpf_local_storage_map *)map; > + bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx); > + bpf_local_storage_map_free(smap, NULL); > +} > + > +/* *gfp_flags* is a hidden argument provided by the verifier */ > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup > *, cgroup, > + void *, value, u64, flags, gfp_t, gfp_flags) > +{ > + struct bpf_local_storage_data *sdata; > + > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) > + return (unsigned long)NULL; > + > + if (!cgroup) > + return (unsigned long)NULL; > + > + if (!bpf_cgroup_storage_trylock()) > + return (unsigned long)NULL; > + > + sdata = cgroup_storage_lookup(cgroup, map, true); > + if (sdata) > + goto unlock; > + > + /* only allocate new storage, when the cgroup is refcounted */ > + if (!percpu_ref_is_dying(&cgroup->self.refcnt) && > + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > *)map, > + value, BPF_NOEXIST, gfp_flags); > + > +unlock: > + bpf_cgroup_storage_unlock(); > + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned > long)sdata->data; > +} > + > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct > cgroup *, cgroup) > +{ > + int ret; > + > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > + if (!cgroup) > + return -EINVAL; > + > + if (!bpf_cgroup_storage_trylock()) > + return -EBUSY; > + > + ret = cgroup_storage_delete(cgroup, map); > + bpf_cgroup_storage_unlock(); > + return ret; > +} > + > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, > bpf_local_storage_map) > +const struct bpf_map_ops cgroup_local_storage_map_ops = { > + .map_meta_equal = bpf_map_meta_equal, > + .map_alloc_check = bpf_local_storage_map_alloc_check, > + .map_alloc = cgroup_storage_map_alloc, > + .map_free = cgroup_storage_map_free, > + .map_get_next_key = notsupp_get_next_key, > + .map_lookup_elem = bpf_cgroup_storage_lookup_elem, > + .map_update_elem = bpf_cgroup_storage_update_elem, > + .map_delete_elem = bpf_cgroup_storage_delete_elem, > + .map_check_btf = bpf_local_storage_map_check_btf, > + .map_btf_id = &cgroup_storage_map_btf_ids[0], > + .map_owner_storage_ptr = cgroup_storage_ptr, > +}; > + > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = { > + .func = bpf_cgroup_storage_get, > + .gpl_only = false, > + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, > + .arg1_type = ARG_CONST_MAP_PTR, > + .arg2_type = ARG_PTR_TO_BTF_ID, > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, > + .arg4_type = ARG_ANYTHING, > +}; > + > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = { > + .func = bpf_cgroup_storage_delete, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_CONST_MAP_PTR, > + .arg2_type = ARG_PTR_TO_BTF_ID, > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > +}; > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index a6b04faed282..5c5bb08832ec 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) > return &bpf_dynptr_write_proto; > case BPF_FUNC_dynptr_data: > return &bpf_dynptr_data_proto; > +#ifdef CONFIG_CGROUPS > + case BPF_FUNC_cgroup_local_storage_get: > + return &bpf_cgroup_storage_get_proto; > + case BPF_FUNC_cgroup_local_storage_delete: > + return &bpf_cgroup_storage_delete_proto; > +#endif > default: > break; > } > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 7b373a5e861f..e53c7fae6e22 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const > struct btf *btf, > map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && > map->map_type != BPF_MAP_TYPE_SK_STORAGE && > map->map_type != BPF_MAP_TYPE_INODE_STORAGE && > - map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > + map->map_type != BPF_MAP_TYPE_TASK_STORAGE && > + map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > return -ENOTSUPP; > if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > > map->value_size) { > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 6f6d2d511c06..f36f6a3c0d50 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct > bpf_verifier_env *env, > func_id != BPF_FUNC_task_storage_delete) > goto error; > break; > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > + if (func_id != BPF_FUNC_cgroup_local_storage_get && > + func_id != BPF_FUNC_cgroup_local_storage_delete) > + goto error; > + break; > case BPF_MAP_TYPE_BLOOM_FILTER: > if (func_id != BPF_FUNC_map_peek_elem && > func_id != BPF_FUNC_map_push_elem) > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct > bpf_verifier_env *env, > if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > goto error; > break; > + case BPF_FUNC_cgroup_local_storage_get: > + case BPF_FUNC_cgroup_local_storage_delete: > + if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > + goto error; > + break; > default: > break; > } > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct > bpf_verifier_env *env, > case BPF_MAP_TYPE_INODE_STORAGE: > case BPF_MAP_TYPE_SK_STORAGE: > case BPF_MAP_TYPE_TASK_STORAGE: > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > break; > default: > verbose(env, > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env > *env) > if (insn->imm == BPF_FUNC_task_storage_get || > insn->imm == BPF_FUNC_sk_storage_get || > - insn->imm == BPF_FUNC_inode_storage_get) { > + insn->imm == BPF_FUNC_inode_storage_get || > + insn->imm == BPF_FUNC_cgroup_local_storage_get) { > if (env->prog->aux->sleepable) > insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL); > else > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 8ad2c267ff47..2fa2c950c7fb 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) > put_css_set_locked(cset->dom_cset); > } > +#ifdef CONFIG_BPF_SYSCALL > + bpf_local_cgroup_storage_free(cset->dfl_cgrp); > +#endif > + > kfree_rcu(cset, rcu_head); > } > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 688552df95ca..179adaae4a9f 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, > const struct bpf_prog *prog) > return &bpf_get_current_cgroup_id_proto; > case BPF_FUNC_get_current_ancestor_cgroup_id: > return &bpf_get_current_ancestor_cgroup_id_proto; > + case BPF_FUNC_cgroup_local_storage_get: > + return &bpf_cgroup_storage_get_proto; > + case BPF_FUNC_cgroup_local_storage_delete: > + return &bpf_cgroup_storage_delete_proto; > #endif > case BPF_FUNC_send_signal: > return &bpf_send_signal_proto; > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py > index c0e6690be82a..fdb0aff8cb5a 100755 > --- a/scripts/bpf_doc.py > +++ b/scripts/bpf_doc.py > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer): > 'struct udp6_sock', > 'struct unix_sock', > 'struct task_struct', > + 'struct cgroup', > 'struct __sk_buff', > 'struct sk_msg_md', > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer): > 'struct udp6_sock', > 'struct unix_sock', > 'struct task_struct', > + 'struct cgroup', > 'struct path', > 'struct btf_ptr', > 'struct inode', > diff --git a/tools/include/uapi/linux/bpf.h > b/tools/include/uapi/linux/bpf.h > index 17f61338f8f8..d918b4054297 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -935,6 +935,7 @@ enum bpf_map_type { > BPF_MAP_TYPE_TASK_STORAGE, > BPF_MAP_TYPE_BLOOM_FILTER, > BPF_MAP_TYPE_USER_RINGBUF, > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > }; > /* Note that tracing related programs such as > @@ -5435,6 +5436,42 @@ union bpf_attr { > * **-E2BIG** if user-space has tried to publish a sample which is > * larger than the size of the ring buffer, or which cannot fit > * within a struct bpf_dynptr. > + * > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > *cgroup, void *value, u64 flags) > + * Description > + * Get a bpf_local_storage from the *cgroup*. > + * > + * Logically, it could be thought of as getting the value from > + * a *map* with *cgroup* as the **key**. From this > + * perspective, the usage is not much different from > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > + * helper enforces the key must be a cgroup struct and the map must also > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > + * > + * Underneath, the value is stored locally at *cgroup* instead of > + * the *map*. The *map* is used as the bpf-local-storage > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > + * searched against all bpf_local_storage residing at *cgroup*. > + * > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > + * used such that a new bpf_local_storage will be > + * created if one does not exist. *value* can be used > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > + * the initial value of a bpf_local_storage. If *value* is > + * **NULL**, the new bpf_local_storage will be zero initialized. > + * Return > + * A bpf_local_storage pointer is returned on success. > + * > + * **NULL** if not found or there was an error in adding > + * a new bpf_local_storage. > + * > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > cgroup *cgroup) > + * Description > + * Delete a bpf_local_storage from a *cgroup*. > + * Return > + * 0 on success. > + * > + * **-ENOENT** if the bpf_local_storage cannot be found. > */ > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > FN(unspec, 0, ##ctx) \ > @@ -5647,6 +5684,8 @@ union bpf_attr { > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > FN(ktime_get_tai_ns, 208, ##ctx) \ > FN(user_ringbuf_drain, 209, ##ctx) \ > + FN(cgroup_local_storage_get, 210, ##ctx) \ > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > /* */ > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > don't > -- > 2.30.2 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:01 ` sdf @ 2022-10-17 18:25 ` Yosry Ahmed 2022-10-17 18:43 ` Stanislav Fomichev 2022-10-17 20:10 ` Yonghong Song 2022-10-17 19:23 ` Yonghong Song 2022-10-17 22:26 ` Martin KaFai Lau 2 siblings, 2 replies; 38+ messages in thread From: Yosry Ahmed @ 2022-10-17 18:25 UTC (permalink / raw) To: sdf Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > > On 10/13, Yonghong Song wrote: > > Similar to sk/inode/task storage, implement similar cgroup local storage. > > > There already exists a local storage implementation for cgroup-attached > > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > > bpf_get_local_storage(). But there are use cases such that non-cgroup > > attached bpf progs wants to access cgroup local storage data. For example, > > tc egress prog has access to sk and cgroup. It is possible to use > > sk local storage to emulate cgroup local storage by storing data in > > socket. > > But this is a waste as it could be lots of sockets belonging to a > > particular > > cgroup. Alternatively, a separate map can be created with cgroup id as > > the key. > > But this will introduce additional overhead to manipulate the new map. > > A cgroup local storage, similar to existing sk/inode/task storage, > > should help for this use case. > > > The life-cycle of storage is managed with the life-cycle of the > > cgroup struct. i.e. the storage is destroyed along with the owning cgroup > > with a callback to the bpf_cgroup_storage_free when cgroup itself > > is deleted. > > > The userspace map operations can be done by using a cgroup fd as a key > > passed to the lookup, update and delete operations. > > > [..] > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > > local > > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > > used > > for cgroup storage available to non-cgroup-attached bpf programs. The two > > helpers are named as bpf_cgroup_local_storage_get() and > > bpf_cgroup_local_storage_delete(). > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make > cgroup storages shared between programs on the same cgroup") where > the map changes its behavior depending on the key size (see key_size checks > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > can be used so we can, in theory, reuse the name.. > > Pros: > - no need for a new map name > > Cons: > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > good idea to add more stuff to it? > > But, for the very least, should we also extend > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > tried to keep some of the important details in there.. This might be a long shot, but is it possible to switch completely to this new generic cgroup storage, and for programs that attach to cgroups we can still do lookups/allocations during attachment like we do today? IOW, maintain the current API for cgroup progs but switch it to use this new map type instead. It feels like this map type is more generic and can be a superset of the existing cgroup storage, but I feel like I am missing something. > > > Signed-off-by: Yonghong Song <yhs@fb.com> > > --- > > include/linux/bpf.h | 3 + > > include/linux/bpf_types.h | 1 + > > include/linux/cgroup-defs.h | 4 + > > include/uapi/linux/bpf.h | 39 +++++ > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > > kernel/bpf/helpers.c | 6 + > > kernel/bpf/syscall.c | 3 +- > > kernel/bpf/verifier.c | 14 +- > > kernel/cgroup/cgroup.c | 4 + > > kernel/trace/bpf_trace.c | 4 + > > scripts/bpf_doc.py | 2 + > > tools/include/uapi/linux/bpf.h | 39 +++++ > > 13 files changed, 398 insertions(+), 3 deletions(-) > > create mode 100644 kernel/bpf/bpf_cgroup_storage.c > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 9e7d46d16032..1395a01c7f18 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > > > const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id > > func_id); > > void bpf_task_storage_free(struct task_struct *task); > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); > > bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); > > const struct btf_func_model * > > bpf_jit_find_kfunc_model(const struct bpf_prog *prog, > > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto > > bpf_copy_from_user_task_proto; > > extern const struct bpf_func_proto bpf_set_retval_proto; > > extern const struct bpf_func_proto bpf_get_retval_proto; > > extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; > > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; > > > const struct bpf_func_proto *tracing_prog_func_proto( > > enum bpf_func_id func_id, const struct bpf_prog *prog); > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > > index 2c6a4f2562a7..7a0362d7a0aa 100644 > > --- a/include/linux/bpf_types.h > > +++ b/include/linux/bpf_types.h > > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, > > cgroup_array_map_ops) > > #ifdef CONFIG_CGROUP_BPF > > BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) > > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > cgroup_local_storage_map_ops) > > #endif > > BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > > index 4bcf56b3491c..c6f4590dda68 100644 > > --- a/include/linux/cgroup-defs.h > > +++ b/include/linux/cgroup-defs.h > > @@ -504,6 +504,10 @@ struct cgroup { > > /* Used to store internal freezer state */ > > struct cgroup_freezer_state freezer; > > > +#ifdef CONFIG_BPF_SYSCALL > > + struct bpf_local_storage __rcu *bpf_cgroup_storage; > > +#endif > > + > > /* ids of the ancestors at each level including self */ > > u64 ancestor_ids[]; > > }; > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 17f61338f8f8..d918b4054297 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > BPF_MAP_TYPE_TASK_STORAGE, > > BPF_MAP_TYPE_BLOOM_FILTER, > > BPF_MAP_TYPE_USER_RINGBUF, > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > }; > > > /* Note that tracing related programs such as > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > * **-E2BIG** if user-space has tried to publish a sample which is > > * larger than the size of the ring buffer, or which cannot fit > > * within a struct bpf_dynptr. > > + * > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > *cgroup, void *value, u64 flags) > > + * Description > > + * Get a bpf_local_storage from the *cgroup*. > > + * > > + * Logically, it could be thought of as getting the value from > > + * a *map* with *cgroup* as the **key**. From this > > + * perspective, the usage is not much different from > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > + * helper enforces the key must be a cgroup struct and the map must also > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > + * > > + * Underneath, the value is stored locally at *cgroup* instead of > > + * the *map*. The *map* is used as the bpf-local-storage > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > + * searched against all bpf_local_storage residing at *cgroup*. > > + * > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > + * used such that a new bpf_local_storage will be > > + * created if one does not exist. *value* can be used > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > + * the initial value of a bpf_local_storage. If *value* is > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > + * Return > > + * A bpf_local_storage pointer is returned on success. > > + * > > + * **NULL** if not found or there was an error in adding > > + * a new bpf_local_storage. > > + * > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > cgroup *cgroup) > > + * Description > > + * Delete a bpf_local_storage from a *cgroup*. > > + * Return > > + * 0 on success. > > + * > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > */ > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > FN(unspec, 0, ##ctx) \ > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > FN(user_ringbuf_drain, 209, ##ctx) \ > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > /* */ > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > don't > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > index 341c94f208f4..b02693f51978 100644 > > --- a/kernel/bpf/Makefile > > +++ b/kernel/bpf/Makefile > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > > endif > > ifeq ($(CONFIG_CGROUPS),y) > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o > > endif > > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > > ifeq ($(CONFIG_INET),y) > > diff --git a/kernel/bpf/bpf_cgroup_storage.c > > b/kernel/bpf/bpf_cgroup_storage.c > > new file mode 100644 > > index 000000000000..9974784822da > > --- /dev/null > > +++ b/kernel/bpf/bpf_cgroup_storage.c > > @@ -0,0 +1,280 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > > + */ > > + > > +#include <linux/types.h> > > +#include <linux/bpf.h> > > +#include <linux/bpf_local_storage.h> > > +#include <uapi/linux/btf.h> > > +#include <linux/btf_ids.h> > > + > > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > > + > > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); > > + > > +static void bpf_cgroup_storage_lock(void) > > +{ > > + migrate_disable(); > > + this_cpu_inc(bpf_cgroup_storage_busy); > > +} > > + > > +static void bpf_cgroup_storage_unlock(void) > > +{ > > + this_cpu_dec(bpf_cgroup_storage_busy); > > + migrate_enable(); > > +} > > + > > +static bool bpf_cgroup_storage_trylock(void) > > +{ > > + migrate_disable(); > > + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { > > + this_cpu_dec(bpf_cgroup_storage_busy); > > + migrate_enable(); > > + return false; > > + } > > + return true; > > +} > > Task storage has lock/unlock/trylock; inode storage doesn't; why does > cgroup need it as well? > > > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > > +{ > > + struct cgroup *cg = owner; > > + > > + return &cg->bpf_cgroup_storage; > > +} > > + > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) > > +{ > > + struct bpf_local_storage *local_storage; > > + struct bpf_local_storage_elem *selem; > > + bool free_cgroup_storage = false; > > + struct hlist_node *n; > > + unsigned long flags; > > + > > + rcu_read_lock(); > > + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); > > + if (!local_storage) { > > + rcu_read_unlock(); > > + return; > > + } > > + > > + /* Neither the bpf_prog nor the bpf-map's syscall > > + * could be modifying the local_storage->list now. > > + * Thus, no elem can be added-to or deleted-from the > > + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > > + * > > + * It is racing with bpf_local_storage_map_free() alone > > + * when unlinking elem from the local_storage->list and > > + * the map's bucket->list. > > + */ > > + bpf_cgroup_storage_lock(); > > + raw_spin_lock_irqsave(&local_storage->lock, flags); > > + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > > + bpf_selem_unlink_map(selem); > > + free_cgroup_storage = > > + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > > + } > > + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > > + bpf_cgroup_storage_unlock(); > > + rcu_read_unlock(); > > + > > + /* free_cgroup_storage should always be true as long as > > + * local_storage->list was non-empty. > > + */ > > + if (free_cgroup_storage) > > + kfree_rcu(local_storage, rcu); > > +} > > > +static struct bpf_local_storage_data * > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool > > cacheit_lockit) > > +{ > > + struct bpf_local_storage *cgroup_storage; > > + struct bpf_local_storage_map *smap; > > + > > + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, > > + bpf_rcu_lock_held()); > > + if (!cgroup_storage) > > + return NULL; > > + > > + smap = (struct bpf_local_storage_map *)map; > > + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > > +} > > + > > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void > > *key) > > +{ > > + struct bpf_local_storage_data *sdata; > > + struct cgroup *cgroup; > > + int fd; > > + > > + fd = *(int *)key; > > + cgroup = cgroup_get_from_fd(fd); > > + if (IS_ERR(cgroup)) > > + return ERR_CAST(cgroup); > > + > > + bpf_cgroup_storage_lock(); > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > + bpf_cgroup_storage_unlock(); > > + cgroup_put(cgroup); > > + return sdata ? sdata->data : NULL; > > +} > > A lot of the above (free/lookup) seems to be copy-pasted from the task > storage; > any point in trying to generalize the common parts? > > > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, > > + void *value, u64 map_flags) > > +{ > > + struct bpf_local_storage_data *sdata; > > + struct cgroup *cgroup; > > + int err, fd; > > + > > + fd = *(int *)key; > > + cgroup = cgroup_get_from_fd(fd); > > + if (IS_ERR(cgroup)) > > + return PTR_ERR(cgroup); > > + > > + bpf_cgroup_storage_lock(); > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > *)map, > > + value, map_flags, GFP_ATOMIC); > > + bpf_cgroup_storage_unlock(); > > + err = PTR_ERR_OR_ZERO(sdata); > > + cgroup_put(cgroup); > > + return err; > > +} > > + > > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map > > *map) > > +{ > > + struct bpf_local_storage_data *sdata; > > + > > + sdata = cgroup_storage_lookup(cgroup, map, false); > > + if (!sdata) > > + return -ENOENT; > > + > > + bpf_selem_unlink(SELEM(sdata), true); > > + return 0; > > +} > > + > > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key) > > +{ > > + struct cgroup *cgroup; > > + int err, fd; > > + > > + fd = *(int *)key; > > + cgroup = cgroup_get_from_fd(fd); > > + if (IS_ERR(cgroup)) > > + return PTR_ERR(cgroup); > > + > > + bpf_cgroup_storage_lock(); > > + err = cgroup_storage_delete(cgroup, map); > > + bpf_cgroup_storage_unlock(); > > + if (err) > > + return err; > > + > > + cgroup_put(cgroup); > > + return 0; > > +} > > + > > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void > > *next_key) > > +{ > > + return -ENOTSUPP; > > +} > > + > > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) > > +{ > > + struct bpf_local_storage_map *smap; > > + > > + smap = bpf_local_storage_map_alloc(attr); > > + if (IS_ERR(smap)) > > + return ERR_CAST(smap); > > + > > + smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache); > > + return &smap->map; > > +} > > + > > +static void cgroup_storage_map_free(struct bpf_map *map) > > +{ > > + struct bpf_local_storage_map *smap; > > + > > + smap = (struct bpf_local_storage_map *)map; > > + bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx); > > + bpf_local_storage_map_free(smap, NULL); > > +} > > + > > +/* *gfp_flags* is a hidden argument provided by the verifier */ > > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup > > *, cgroup, > > + void *, value, u64, flags, gfp_t, gfp_flags) > > +{ > > + struct bpf_local_storage_data *sdata; > > + > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) > > + return (unsigned long)NULL; > > + > > + if (!cgroup) > > + return (unsigned long)NULL; > > + > > + if (!bpf_cgroup_storage_trylock()) > > + return (unsigned long)NULL; > > + > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > + if (sdata) > > + goto unlock; > > + > > + /* only allocate new storage, when the cgroup is refcounted */ > > + if (!percpu_ref_is_dying(&cgroup->self.refcnt) && > > + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > *)map, > > + value, BPF_NOEXIST, gfp_flags); > > + > > +unlock: > > + bpf_cgroup_storage_unlock(); > > + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned > > long)sdata->data; > > +} > > + > > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct > > cgroup *, cgroup) > > +{ > > + int ret; > > + > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > + if (!cgroup) > > + return -EINVAL; > > + > > + if (!bpf_cgroup_storage_trylock()) > > + return -EBUSY; > > + > > + ret = cgroup_storage_delete(cgroup, map); > > + bpf_cgroup_storage_unlock(); > > + return ret; > > +} > > + > > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, > > bpf_local_storage_map) > > +const struct bpf_map_ops cgroup_local_storage_map_ops = { > > + .map_meta_equal = bpf_map_meta_equal, > > + .map_alloc_check = bpf_local_storage_map_alloc_check, > > + .map_alloc = cgroup_storage_map_alloc, > > + .map_free = cgroup_storage_map_free, > > + .map_get_next_key = notsupp_get_next_key, > > + .map_lookup_elem = bpf_cgroup_storage_lookup_elem, > > + .map_update_elem = bpf_cgroup_storage_update_elem, > > + .map_delete_elem = bpf_cgroup_storage_delete_elem, > > + .map_check_btf = bpf_local_storage_map_check_btf, > > + .map_btf_id = &cgroup_storage_map_btf_ids[0], > > + .map_owner_storage_ptr = cgroup_storage_ptr, > > +}; > > + > > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = { > > + .func = bpf_cgroup_storage_get, > > + .gpl_only = false, > > + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, > > + .arg1_type = ARG_CONST_MAP_PTR, > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, > > + .arg4_type = ARG_ANYTHING, > > +}; > > + > > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = { > > + .func = bpf_cgroup_storage_delete, > > + .gpl_only = false, > > + .ret_type = RET_INTEGER, > > + .arg1_type = ARG_CONST_MAP_PTR, > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > +}; > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > index a6b04faed282..5c5bb08832ec 100644 > > --- a/kernel/bpf/helpers.c > > +++ b/kernel/bpf/helpers.c > > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) > > return &bpf_dynptr_write_proto; > > case BPF_FUNC_dynptr_data: > > return &bpf_dynptr_data_proto; > > +#ifdef CONFIG_CGROUPS > > + case BPF_FUNC_cgroup_local_storage_get: > > + return &bpf_cgroup_storage_get_proto; > > + case BPF_FUNC_cgroup_local_storage_delete: > > + return &bpf_cgroup_storage_delete_proto; > > +#endif > > default: > > break; > > } > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index 7b373a5e861f..e53c7fae6e22 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const > > struct btf *btf, > > map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && > > map->map_type != BPF_MAP_TYPE_SK_STORAGE && > > map->map_type != BPF_MAP_TYPE_INODE_STORAGE && > > - map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > + map->map_type != BPF_MAP_TYPE_TASK_STORAGE && > > + map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > return -ENOTSUPP; > > if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > > > map->value_size) { > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > index 6f6d2d511c06..f36f6a3c0d50 100644 > > --- a/kernel/bpf/verifier.c > > +++ b/kernel/bpf/verifier.c > > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct > > bpf_verifier_env *env, > > func_id != BPF_FUNC_task_storage_delete) > > goto error; > > break; > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > + if (func_id != BPF_FUNC_cgroup_local_storage_get && > > + func_id != BPF_FUNC_cgroup_local_storage_delete) > > + goto error; > > + break; > > case BPF_MAP_TYPE_BLOOM_FILTER: > > if (func_id != BPF_FUNC_map_peek_elem && > > func_id != BPF_FUNC_map_push_elem) > > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct > > bpf_verifier_env *env, > > if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > goto error; > > break; > > + case BPF_FUNC_cgroup_local_storage_get: > > + case BPF_FUNC_cgroup_local_storage_delete: > > + if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > + goto error; > > + break; > > default: > > break; > > } > > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct > > bpf_verifier_env *env, > > case BPF_MAP_TYPE_INODE_STORAGE: > > case BPF_MAP_TYPE_SK_STORAGE: > > case BPF_MAP_TYPE_TASK_STORAGE: > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > break; > > default: > > verbose(env, > > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env > > *env) > > > if (insn->imm == BPF_FUNC_task_storage_get || > > insn->imm == BPF_FUNC_sk_storage_get || > > - insn->imm == BPF_FUNC_inode_storage_get) { > > + insn->imm == BPF_FUNC_inode_storage_get || > > + insn->imm == BPF_FUNC_cgroup_local_storage_get) { > > if (env->prog->aux->sleepable) > > insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL); > > else > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > index 8ad2c267ff47..2fa2c950c7fb 100644 > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) > > put_css_set_locked(cset->dom_cset); > > } > > > +#ifdef CONFIG_BPF_SYSCALL > > + bpf_local_cgroup_storage_free(cset->dfl_cgrp); > > +#endif > > + I am confused about this freeing site. It seems like this path is for freeing css_set's of task_structs, not for freeing the cgroup itself. Wouldn't we want to free the local storage when we free the cgroup itself? Somewhere like css_free_rwork_fn()? or did I completely miss the point here? > > kfree_rcu(cset, rcu_head); > > } > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > index 688552df95ca..179adaae4a9f 100644 > > --- a/kernel/trace/bpf_trace.c > > +++ b/kernel/trace/bpf_trace.c > > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, > > const struct bpf_prog *prog) > > return &bpf_get_current_cgroup_id_proto; > > case BPF_FUNC_get_current_ancestor_cgroup_id: > > return &bpf_get_current_ancestor_cgroup_id_proto; > > + case BPF_FUNC_cgroup_local_storage_get: > > + return &bpf_cgroup_storage_get_proto; > > + case BPF_FUNC_cgroup_local_storage_delete: > > + return &bpf_cgroup_storage_delete_proto; > > #endif > > case BPF_FUNC_send_signal: > > return &bpf_send_signal_proto; > > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py > > index c0e6690be82a..fdb0aff8cb5a 100755 > > --- a/scripts/bpf_doc.py > > +++ b/scripts/bpf_doc.py > > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer): > > 'struct udp6_sock', > > 'struct unix_sock', > > 'struct task_struct', > > + 'struct cgroup', > > > 'struct __sk_buff', > > 'struct sk_msg_md', > > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer): > > 'struct udp6_sock', > > 'struct unix_sock', > > 'struct task_struct', > > + 'struct cgroup', > > 'struct path', > > 'struct btf_ptr', > > 'struct inode', > > diff --git a/tools/include/uapi/linux/bpf.h > > b/tools/include/uapi/linux/bpf.h > > index 17f61338f8f8..d918b4054297 100644 > > --- a/tools/include/uapi/linux/bpf.h > > +++ b/tools/include/uapi/linux/bpf.h > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > BPF_MAP_TYPE_TASK_STORAGE, > > BPF_MAP_TYPE_BLOOM_FILTER, > > BPF_MAP_TYPE_USER_RINGBUF, > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > }; > > > /* Note that tracing related programs such as > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > * **-E2BIG** if user-space has tried to publish a sample which is > > * larger than the size of the ring buffer, or which cannot fit > > * within a struct bpf_dynptr. > > + * > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > *cgroup, void *value, u64 flags) > > + * Description > > + * Get a bpf_local_storage from the *cgroup*. > > + * > > + * Logically, it could be thought of as getting the value from > > + * a *map* with *cgroup* as the **key**. From this > > + * perspective, the usage is not much different from > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > + * helper enforces the key must be a cgroup struct and the map must also > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > + * > > + * Underneath, the value is stored locally at *cgroup* instead of > > + * the *map*. The *map* is used as the bpf-local-storage > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > + * searched against all bpf_local_storage residing at *cgroup*. > > + * > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > + * used such that a new bpf_local_storage will be > > + * created if one does not exist. *value* can be used > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > + * the initial value of a bpf_local_storage. If *value* is > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > + * Return > > + * A bpf_local_storage pointer is returned on success. > > + * > > + * **NULL** if not found or there was an error in adding > > + * a new bpf_local_storage. > > + * > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > cgroup *cgroup) > > + * Description > > + * Delete a bpf_local_storage from a *cgroup*. > > + * Return > > + * 0 on success. > > + * > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > */ > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > FN(unspec, 0, ##ctx) \ > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > FN(user_ringbuf_drain, 209, ##ctx) \ > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > /* */ > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > don't > > -- > > 2.30.2 > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:25 ` Yosry Ahmed @ 2022-10-17 18:43 ` Stanislav Fomichev 2022-10-17 18:47 ` Yosry Ahmed 2022-10-17 20:13 ` Yonghong Song 2022-10-17 20:10 ` Yonghong Song 1 sibling, 2 replies; 38+ messages in thread From: Stanislav Fomichev @ 2022-10-17 18:43 UTC (permalink / raw) To: Yosry Ahmed Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > > > > On 10/13, Yonghong Song wrote: > > > Similar to sk/inode/task storage, implement similar cgroup local storage. > > > > > There already exists a local storage implementation for cgroup-attached > > > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > > > bpf_get_local_storage(). But there are use cases such that non-cgroup > > > attached bpf progs wants to access cgroup local storage data. For example, > > > tc egress prog has access to sk and cgroup. It is possible to use > > > sk local storage to emulate cgroup local storage by storing data in > > > socket. > > > But this is a waste as it could be lots of sockets belonging to a > > > particular > > > cgroup. Alternatively, a separate map can be created with cgroup id as > > > the key. > > > But this will introduce additional overhead to manipulate the new map. > > > A cgroup local storage, similar to existing sk/inode/task storage, > > > should help for this use case. > > > > > The life-cycle of storage is managed with the life-cycle of the > > > cgroup struct. i.e. the storage is destroyed along with the owning cgroup > > > with a callback to the bpf_cgroup_storage_free when cgroup itself > > > is deleted. > > > > > The userspace map operations can be done by using a cgroup fd as a key > > > passed to the lookup, update and delete operations. > > > > > > [..] > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > > > local > > > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > > > used > > > for cgroup storage available to non-cgroup-attached bpf programs. The two > > > helpers are named as bpf_cgroup_local_storage_get() and > > > bpf_cgroup_local_storage_delete(). > > > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make > > cgroup storages shared between programs on the same cgroup") where > > the map changes its behavior depending on the key size (see key_size checks > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > > can be used so we can, in theory, reuse the name.. > > > > Pros: > > - no need for a new map name > > > > Cons: > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > > good idea to add more stuff to it? > > > > But, for the very least, should we also extend > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > > tried to keep some of the important details in there.. > > This might be a long shot, but is it possible to switch completely to > this new generic cgroup storage, and for programs that attach to > cgroups we can still do lookups/allocations during attachment like we > do today? IOW, maintain the current API for cgroup progs but switch it > to use this new map type instead. > > It feels like this map type is more generic and can be a superset of > the existing cgroup storage, but I feel like I am missing something. I feel like the biggest issue is that the existing bpf_get_local_storage helper is guaranteed to always return non-null and the verifier doesn't require the programs to do null checks on it; the new helper might return NULL making all existing programs fail the verifier. There might be something else I don't remember at this point (besides that weird per-prog_type that we'd have to emulate as well).. > > > > > Signed-off-by: Yonghong Song <yhs@fb.com> > > > --- > > > include/linux/bpf.h | 3 + > > > include/linux/bpf_types.h | 1 + > > > include/linux/cgroup-defs.h | 4 + > > > include/uapi/linux/bpf.h | 39 +++++ > > > kernel/bpf/Makefile | 2 +- > > > kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > > > kernel/bpf/helpers.c | 6 + > > > kernel/bpf/syscall.c | 3 +- > > > kernel/bpf/verifier.c | 14 +- > > > kernel/cgroup/cgroup.c | 4 + > > > kernel/trace/bpf_trace.c | 4 + > > > scripts/bpf_doc.py | 2 + > > > tools/include/uapi/linux/bpf.h | 39 +++++ > > > 13 files changed, 398 insertions(+), 3 deletions(-) > > > create mode 100644 kernel/bpf/bpf_cgroup_storage.c > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > index 9e7d46d16032..1395a01c7f18 100644 > > > --- a/include/linux/bpf.h > > > +++ b/include/linux/bpf.h > > > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > > > > > const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id > > > func_id); > > > void bpf_task_storage_free(struct task_struct *task); > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); > > > bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); > > > const struct btf_func_model * > > > bpf_jit_find_kfunc_model(const struct bpf_prog *prog, > > > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto > > > bpf_copy_from_user_task_proto; > > > extern const struct bpf_func_proto bpf_set_retval_proto; > > > extern const struct bpf_func_proto bpf_get_retval_proto; > > > extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > > > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; > > > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; > > > > > const struct bpf_func_proto *tracing_prog_func_proto( > > > enum bpf_func_id func_id, const struct bpf_prog *prog); > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > > > index 2c6a4f2562a7..7a0362d7a0aa 100644 > > > --- a/include/linux/bpf_types.h > > > +++ b/include/linux/bpf_types.h > > > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, > > > cgroup_array_map_ops) > > > #ifdef CONFIG_CGROUP_BPF > > > BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) > > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) > > > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > cgroup_local_storage_map_ops) > > > #endif > > > BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) > > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) > > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > > > index 4bcf56b3491c..c6f4590dda68 100644 > > > --- a/include/linux/cgroup-defs.h > > > +++ b/include/linux/cgroup-defs.h > > > @@ -504,6 +504,10 @@ struct cgroup { > > > /* Used to store internal freezer state */ > > > struct cgroup_freezer_state freezer; > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > + struct bpf_local_storage __rcu *bpf_cgroup_storage; > > > +#endif > > > + > > > /* ids of the ancestors at each level including self */ > > > u64 ancestor_ids[]; > > > }; > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > index 17f61338f8f8..d918b4054297 100644 > > > --- a/include/uapi/linux/bpf.h > > > +++ b/include/uapi/linux/bpf.h > > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > > BPF_MAP_TYPE_TASK_STORAGE, > > > BPF_MAP_TYPE_BLOOM_FILTER, > > > BPF_MAP_TYPE_USER_RINGBUF, > > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > }; > > > > > /* Note that tracing related programs such as > > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > > * **-E2BIG** if user-space has tried to publish a sample which is > > > * larger than the size of the ring buffer, or which cannot fit > > > * within a struct bpf_dynptr. > > > + * > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > > *cgroup, void *value, u64 flags) > > > + * Description > > > + * Get a bpf_local_storage from the *cgroup*. > > > + * > > > + * Logically, it could be thought of as getting the value from > > > + * a *map* with *cgroup* as the **key**. From this > > > + * perspective, the usage is not much different from > > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > > + * helper enforces the key must be a cgroup struct and the map must also > > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > > + * > > > + * Underneath, the value is stored locally at *cgroup* instead of > > > + * the *map*. The *map* is used as the bpf-local-storage > > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > > + * searched against all bpf_local_storage residing at *cgroup*. > > > + * > > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > > + * used such that a new bpf_local_storage will be > > > + * created if one does not exist. *value* can be used > > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > > + * the initial value of a bpf_local_storage. If *value* is > > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > > + * Return > > > + * A bpf_local_storage pointer is returned on success. > > > + * > > > + * **NULL** if not found or there was an error in adding > > > + * a new bpf_local_storage. > > > + * > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > > cgroup *cgroup) > > > + * Description > > > + * Delete a bpf_local_storage from a *cgroup*. > > > + * Return > > > + * 0 on success. > > > + * > > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > > */ > > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > > FN(unspec, 0, ##ctx) \ > > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > > FN(user_ringbuf_drain, 209, ##ctx) \ > > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > > /* */ > > > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > > don't > > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > > index 341c94f208f4..b02693f51978 100644 > > > --- a/kernel/bpf/Makefile > > > +++ b/kernel/bpf/Makefile > > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > > > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > > > endif > > > ifeq ($(CONFIG_CGROUPS),y) > > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o > > > endif > > > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > > > ifeq ($(CONFIG_INET),y) > > > diff --git a/kernel/bpf/bpf_cgroup_storage.c > > > b/kernel/bpf/bpf_cgroup_storage.c > > > new file mode 100644 > > > index 000000000000..9974784822da > > > --- /dev/null > > > +++ b/kernel/bpf/bpf_cgroup_storage.c > > > @@ -0,0 +1,280 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +/* > > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > > > + */ > > > + > > > +#include <linux/types.h> > > > +#include <linux/bpf.h> > > > +#include <linux/bpf_local_storage.h> > > > +#include <uapi/linux/btf.h> > > > +#include <linux/btf_ids.h> > > > + > > > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > > > + > > > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); > > > + > > > +static void bpf_cgroup_storage_lock(void) > > > +{ > > > + migrate_disable(); > > > + this_cpu_inc(bpf_cgroup_storage_busy); > > > +} > > > + > > > +static void bpf_cgroup_storage_unlock(void) > > > +{ > > > + this_cpu_dec(bpf_cgroup_storage_busy); > > > + migrate_enable(); > > > +} > > > + > > > +static bool bpf_cgroup_storage_trylock(void) > > > +{ > > > + migrate_disable(); > > > + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { > > > + this_cpu_dec(bpf_cgroup_storage_busy); > > > + migrate_enable(); > > > + return false; > > > + } > > > + return true; > > > +} > > > > Task storage has lock/unlock/trylock; inode storage doesn't; why does > > cgroup need it as well? > > > > > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > > > +{ > > > + struct cgroup *cg = owner; > > > + > > > + return &cg->bpf_cgroup_storage; > > > +} > > > + > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) > > > +{ > > > + struct bpf_local_storage *local_storage; > > > + struct bpf_local_storage_elem *selem; > > > + bool free_cgroup_storage = false; > > > + struct hlist_node *n; > > > + unsigned long flags; > > > + > > > + rcu_read_lock(); > > > + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); > > > + if (!local_storage) { > > > + rcu_read_unlock(); > > > + return; > > > + } > > > + > > > + /* Neither the bpf_prog nor the bpf-map's syscall > > > + * could be modifying the local_storage->list now. > > > + * Thus, no elem can be added-to or deleted-from the > > > + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > > > + * > > > + * It is racing with bpf_local_storage_map_free() alone > > > + * when unlinking elem from the local_storage->list and > > > + * the map's bucket->list. > > > + */ > > > + bpf_cgroup_storage_lock(); > > > + raw_spin_lock_irqsave(&local_storage->lock, flags); > > > + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > > > + bpf_selem_unlink_map(selem); > > > + free_cgroup_storage = > > > + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > > > + } > > > + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > > > + bpf_cgroup_storage_unlock(); > > > + rcu_read_unlock(); > > > + > > > + /* free_cgroup_storage should always be true as long as > > > + * local_storage->list was non-empty. > > > + */ > > > + if (free_cgroup_storage) > > > + kfree_rcu(local_storage, rcu); > > > +} > > > > > +static struct bpf_local_storage_data * > > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool > > > cacheit_lockit) > > > +{ > > > + struct bpf_local_storage *cgroup_storage; > > > + struct bpf_local_storage_map *smap; > > > + > > > + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, > > > + bpf_rcu_lock_held()); > > > + if (!cgroup_storage) > > > + return NULL; > > > + > > > + smap = (struct bpf_local_storage_map *)map; > > > + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > > > +} > > > + > > > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void > > > *key) > > > +{ > > > + struct bpf_local_storage_data *sdata; > > > + struct cgroup *cgroup; > > > + int fd; > > > + > > > + fd = *(int *)key; > > > + cgroup = cgroup_get_from_fd(fd); > > > + if (IS_ERR(cgroup)) > > > + return ERR_CAST(cgroup); > > > + > > > + bpf_cgroup_storage_lock(); > > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > > + bpf_cgroup_storage_unlock(); > > > + cgroup_put(cgroup); > > > + return sdata ? sdata->data : NULL; > > > +} > > > > A lot of the above (free/lookup) seems to be copy-pasted from the task > > storage; > > any point in trying to generalize the common parts? > > > > > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, > > > + void *value, u64 map_flags) > > > +{ > > > + struct bpf_local_storage_data *sdata; > > > + struct cgroup *cgroup; > > > + int err, fd; > > > + > > > + fd = *(int *)key; > > > + cgroup = cgroup_get_from_fd(fd); > > > + if (IS_ERR(cgroup)) > > > + return PTR_ERR(cgroup); > > > + > > > + bpf_cgroup_storage_lock(); > > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > > *)map, > > > + value, map_flags, GFP_ATOMIC); > > > + bpf_cgroup_storage_unlock(); > > > + err = PTR_ERR_OR_ZERO(sdata); > > > + cgroup_put(cgroup); > > > + return err; > > > +} > > > + > > > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map > > > *map) > > > +{ > > > + struct bpf_local_storage_data *sdata; > > > + > > > + sdata = cgroup_storage_lookup(cgroup, map, false); > > > + if (!sdata) > > > + return -ENOENT; > > > + > > > + bpf_selem_unlink(SELEM(sdata), true); > > > + return 0; > > > +} > > > + > > > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key) > > > +{ > > > + struct cgroup *cgroup; > > > + int err, fd; > > > + > > > + fd = *(int *)key; > > > + cgroup = cgroup_get_from_fd(fd); > > > + if (IS_ERR(cgroup)) > > > + return PTR_ERR(cgroup); > > > + > > > + bpf_cgroup_storage_lock(); > > > + err = cgroup_storage_delete(cgroup, map); > > > + bpf_cgroup_storage_unlock(); > > > + if (err) > > > + return err; > > > + > > > + cgroup_put(cgroup); > > > + return 0; > > > +} > > > + > > > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void > > > *next_key) > > > +{ > > > + return -ENOTSUPP; > > > +} > > > + > > > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) > > > +{ > > > + struct bpf_local_storage_map *smap; > > > + > > > + smap = bpf_local_storage_map_alloc(attr); > > > + if (IS_ERR(smap)) > > > + return ERR_CAST(smap); > > > + > > > + smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache); > > > + return &smap->map; > > > +} > > > + > > > +static void cgroup_storage_map_free(struct bpf_map *map) > > > +{ > > > + struct bpf_local_storage_map *smap; > > > + > > > + smap = (struct bpf_local_storage_map *)map; > > > + bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx); > > > + bpf_local_storage_map_free(smap, NULL); > > > +} > > > + > > > +/* *gfp_flags* is a hidden argument provided by the verifier */ > > > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup > > > *, cgroup, > > > + void *, value, u64, flags, gfp_t, gfp_flags) > > > +{ > > > + struct bpf_local_storage_data *sdata; > > > + > > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > > + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) > > > + return (unsigned long)NULL; > > > + > > > + if (!cgroup) > > > + return (unsigned long)NULL; > > > + > > > + if (!bpf_cgroup_storage_trylock()) > > > + return (unsigned long)NULL; > > > + > > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > > + if (sdata) > > > + goto unlock; > > > + > > > + /* only allocate new storage, when the cgroup is refcounted */ > > > + if (!percpu_ref_is_dying(&cgroup->self.refcnt) && > > > + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) > > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > > *)map, > > > + value, BPF_NOEXIST, gfp_flags); > > > + > > > +unlock: > > > + bpf_cgroup_storage_unlock(); > > > + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned > > > long)sdata->data; > > > +} > > > + > > > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct > > > cgroup *, cgroup) > > > +{ > > > + int ret; > > > + > > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > > + if (!cgroup) > > > + return -EINVAL; > > > + > > > + if (!bpf_cgroup_storage_trylock()) > > > + return -EBUSY; > > > + > > > + ret = cgroup_storage_delete(cgroup, map); > > > + bpf_cgroup_storage_unlock(); > > > + return ret; > > > +} > > > + > > > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, > > > bpf_local_storage_map) > > > +const struct bpf_map_ops cgroup_local_storage_map_ops = { > > > + .map_meta_equal = bpf_map_meta_equal, > > > + .map_alloc_check = bpf_local_storage_map_alloc_check, > > > + .map_alloc = cgroup_storage_map_alloc, > > > + .map_free = cgroup_storage_map_free, > > > + .map_get_next_key = notsupp_get_next_key, > > > + .map_lookup_elem = bpf_cgroup_storage_lookup_elem, > > > + .map_update_elem = bpf_cgroup_storage_update_elem, > > > + .map_delete_elem = bpf_cgroup_storage_delete_elem, > > > + .map_check_btf = bpf_local_storage_map_check_btf, > > > + .map_btf_id = &cgroup_storage_map_btf_ids[0], > > > + .map_owner_storage_ptr = cgroup_storage_ptr, > > > +}; > > > + > > > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = { > > > + .func = bpf_cgroup_storage_get, > > > + .gpl_only = false, > > > + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, > > > + .arg1_type = ARG_CONST_MAP_PTR, > > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > > + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, > > > + .arg4_type = ARG_ANYTHING, > > > +}; > > > + > > > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = { > > > + .func = bpf_cgroup_storage_delete, > > > + .gpl_only = false, > > > + .ret_type = RET_INTEGER, > > > + .arg1_type = ARG_CONST_MAP_PTR, > > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > > +}; > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > index a6b04faed282..5c5bb08832ec 100644 > > > --- a/kernel/bpf/helpers.c > > > +++ b/kernel/bpf/helpers.c > > > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) > > > return &bpf_dynptr_write_proto; > > > case BPF_FUNC_dynptr_data: > > > return &bpf_dynptr_data_proto; > > > +#ifdef CONFIG_CGROUPS > > > + case BPF_FUNC_cgroup_local_storage_get: > > > + return &bpf_cgroup_storage_get_proto; > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > + return &bpf_cgroup_storage_delete_proto; > > > +#endif > > > default: > > > break; > > > } > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > index 7b373a5e861f..e53c7fae6e22 100644 > > > --- a/kernel/bpf/syscall.c > > > +++ b/kernel/bpf/syscall.c > > > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const > > > struct btf *btf, > > > map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && > > > map->map_type != BPF_MAP_TYPE_SK_STORAGE && > > > map->map_type != BPF_MAP_TYPE_INODE_STORAGE && > > > - map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > > + map->map_type != BPF_MAP_TYPE_TASK_STORAGE && > > > + map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > > return -ENOTSUPP; > > > if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > > > > map->value_size) { > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > > index 6f6d2d511c06..f36f6a3c0d50 100644 > > > --- a/kernel/bpf/verifier.c > > > +++ b/kernel/bpf/verifier.c > > > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct > > > bpf_verifier_env *env, > > > func_id != BPF_FUNC_task_storage_delete) > > > goto error; > > > break; > > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > > + if (func_id != BPF_FUNC_cgroup_local_storage_get && > > > + func_id != BPF_FUNC_cgroup_local_storage_delete) > > > + goto error; > > > + break; > > > case BPF_MAP_TYPE_BLOOM_FILTER: > > > if (func_id != BPF_FUNC_map_peek_elem && > > > func_id != BPF_FUNC_map_push_elem) > > > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct > > > bpf_verifier_env *env, > > > if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > > goto error; > > > break; > > > + case BPF_FUNC_cgroup_local_storage_get: > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > + if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > > + goto error; > > > + break; > > > default: > > > break; > > > } > > > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct > > > bpf_verifier_env *env, > > > case BPF_MAP_TYPE_INODE_STORAGE: > > > case BPF_MAP_TYPE_SK_STORAGE: > > > case BPF_MAP_TYPE_TASK_STORAGE: > > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > > break; > > > default: > > > verbose(env, > > > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env > > > *env) > > > > > if (insn->imm == BPF_FUNC_task_storage_get || > > > insn->imm == BPF_FUNC_sk_storage_get || > > > - insn->imm == BPF_FUNC_inode_storage_get) { > > > + insn->imm == BPF_FUNC_inode_storage_get || > > > + insn->imm == BPF_FUNC_cgroup_local_storage_get) { > > > if (env->prog->aux->sleepable) > > > insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL); > > > else > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > > index 8ad2c267ff47..2fa2c950c7fb 100644 > > > --- a/kernel/cgroup/cgroup.c > > > +++ b/kernel/cgroup/cgroup.c > > > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) > > > put_css_set_locked(cset->dom_cset); > > > } > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > + bpf_local_cgroup_storage_free(cset->dfl_cgrp); > > > +#endif > > > + > > I am confused about this freeing site. It seems like this path is for > freeing css_set's of task_structs, not for freeing the cgroup itself. > Wouldn't we want to free the local storage when we free the cgroup > itself? Somewhere like css_free_rwork_fn()? or did I completely miss > the point here? > > > > kfree_rcu(cset, rcu_head); > > > } > > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > > index 688552df95ca..179adaae4a9f 100644 > > > --- a/kernel/trace/bpf_trace.c > > > +++ b/kernel/trace/bpf_trace.c > > > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, > > > const struct bpf_prog *prog) > > > return &bpf_get_current_cgroup_id_proto; > > > case BPF_FUNC_get_current_ancestor_cgroup_id: > > > return &bpf_get_current_ancestor_cgroup_id_proto; > > > + case BPF_FUNC_cgroup_local_storage_get: > > > + return &bpf_cgroup_storage_get_proto; > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > + return &bpf_cgroup_storage_delete_proto; > > > #endif > > > case BPF_FUNC_send_signal: > > > return &bpf_send_signal_proto; > > > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py > > > index c0e6690be82a..fdb0aff8cb5a 100755 > > > --- a/scripts/bpf_doc.py > > > +++ b/scripts/bpf_doc.py > > > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer): > > > 'struct udp6_sock', > > > 'struct unix_sock', > > > 'struct task_struct', > > > + 'struct cgroup', > > > > > 'struct __sk_buff', > > > 'struct sk_msg_md', > > > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer): > > > 'struct udp6_sock', > > > 'struct unix_sock', > > > 'struct task_struct', > > > + 'struct cgroup', > > > 'struct path', > > > 'struct btf_ptr', > > > 'struct inode', > > > diff --git a/tools/include/uapi/linux/bpf.h > > > b/tools/include/uapi/linux/bpf.h > > > index 17f61338f8f8..d918b4054297 100644 > > > --- a/tools/include/uapi/linux/bpf.h > > > +++ b/tools/include/uapi/linux/bpf.h > > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > > BPF_MAP_TYPE_TASK_STORAGE, > > > BPF_MAP_TYPE_BLOOM_FILTER, > > > BPF_MAP_TYPE_USER_RINGBUF, > > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > }; > > > > > /* Note that tracing related programs such as > > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > > * **-E2BIG** if user-space has tried to publish a sample which is > > > * larger than the size of the ring buffer, or which cannot fit > > > * within a struct bpf_dynptr. > > > + * > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > > *cgroup, void *value, u64 flags) > > > + * Description > > > + * Get a bpf_local_storage from the *cgroup*. > > > + * > > > + * Logically, it could be thought of as getting the value from > > > + * a *map* with *cgroup* as the **key**. From this > > > + * perspective, the usage is not much different from > > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > > + * helper enforces the key must be a cgroup struct and the map must also > > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > > + * > > > + * Underneath, the value is stored locally at *cgroup* instead of > > > + * the *map*. The *map* is used as the bpf-local-storage > > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > > + * searched against all bpf_local_storage residing at *cgroup*. > > > + * > > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > > + * used such that a new bpf_local_storage will be > > > + * created if one does not exist. *value* can be used > > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > > + * the initial value of a bpf_local_storage. If *value* is > > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > > + * Return > > > + * A bpf_local_storage pointer is returned on success. > > > + * > > > + * **NULL** if not found or there was an error in adding > > > + * a new bpf_local_storage. > > > + * > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > > cgroup *cgroup) > > > + * Description > > > + * Delete a bpf_local_storage from a *cgroup*. > > > + * Return > > > + * 0 on success. > > > + * > > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > > */ > > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > > FN(unspec, 0, ##ctx) \ > > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > > FN(user_ringbuf_drain, 209, ##ctx) \ > > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > > /* */ > > > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > > don't > > > -- > > > 2.30.2 > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:43 ` Stanislav Fomichev @ 2022-10-17 18:47 ` Yosry Ahmed 2022-10-17 19:07 ` Stanislav Fomichev 2022-10-17 20:15 ` Yonghong Song 2022-10-17 20:13 ` Yonghong Song 1 sibling, 2 replies; 38+ messages in thread From: Yosry Ahmed @ 2022-10-17 18:47 UTC (permalink / raw) To: Stanislav Fomichev Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > > > > > > On 10/13, Yonghong Song wrote: > > > > Similar to sk/inode/task storage, implement similar cgroup local storage. > > > > > > > There already exists a local storage implementation for cgroup-attached > > > > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > > > > bpf_get_local_storage(). But there are use cases such that non-cgroup > > > > attached bpf progs wants to access cgroup local storage data. For example, > > > > tc egress prog has access to sk and cgroup. It is possible to use > > > > sk local storage to emulate cgroup local storage by storing data in > > > > socket. > > > > But this is a waste as it could be lots of sockets belonging to a > > > > particular > > > > cgroup. Alternatively, a separate map can be created with cgroup id as > > > > the key. > > > > But this will introduce additional overhead to manipulate the new map. > > > > A cgroup local storage, similar to existing sk/inode/task storage, > > > > should help for this use case. > > > > > > > The life-cycle of storage is managed with the life-cycle of the > > > > cgroup struct. i.e. the storage is destroyed along with the owning cgroup > > > > with a callback to the bpf_cgroup_storage_free when cgroup itself > > > > is deleted. > > > > > > > The userspace map operations can be done by using a cgroup fd as a key > > > > passed to the lookup, update and delete operations. > > > > > > > > > [..] > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > > > > local > > > > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > > > > used > > > > for cgroup storage available to non-cgroup-attached bpf programs. The two > > > > helpers are named as bpf_cgroup_local_storage_get() and > > > > bpf_cgroup_local_storage_delete(). > > > > > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make > > > cgroup storages shared between programs on the same cgroup") where > > > the map changes its behavior depending on the key size (see key_size checks > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > > > can be used so we can, in theory, reuse the name.. > > > > > > Pros: > > > - no need for a new map name > > > > > > Cons: > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > > > good idea to add more stuff to it? > > > > > > But, for the very least, should we also extend > > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > > > tried to keep some of the important details in there.. > > > > This might be a long shot, but is it possible to switch completely to > > this new generic cgroup storage, and for programs that attach to > > cgroups we can still do lookups/allocations during attachment like we > > do today? IOW, maintain the current API for cgroup progs but switch it > > to use this new map type instead. > > > > It feels like this map type is more generic and can be a superset of > > the existing cgroup storage, but I feel like I am missing something. > > I feel like the biggest issue is that the existing > bpf_get_local_storage helper is guaranteed to always return non-null > and the verifier doesn't require the programs to do null checks on it; > the new helper might return NULL making all existing programs fail the > verifier. What I meant is, keep the old bpf_get_local_storage helper only for cgroup-attached programs like we have today, and add a new generic bpf_cgroup_local_storage_get() helper. For cgroup-attached programs, make sure a cgroup storage entry is allocated and hooked to the helper on program attach time, to keep today's behavior constant. For other programs, the bpf_cgroup_local_storage_get() will do the normal lookup and allocate if necessary. Does this make any sense to you? > > There might be something else I don't remember at this point (besides > that weird per-prog_type that we'd have to emulate as well).. Yeah there are things that will need to be emulated, but I feel like we may end up with less confusing code (and less code in general). > > > > > > > > Signed-off-by: Yonghong Song <yhs@fb.com> > > > > --- > > > > include/linux/bpf.h | 3 + > > > > include/linux/bpf_types.h | 1 + > > > > include/linux/cgroup-defs.h | 4 + > > > > include/uapi/linux/bpf.h | 39 +++++ > > > > kernel/bpf/Makefile | 2 +- > > > > kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > > > > kernel/bpf/helpers.c | 6 + > > > > kernel/bpf/syscall.c | 3 +- > > > > kernel/bpf/verifier.c | 14 +- > > > > kernel/cgroup/cgroup.c | 4 + > > > > kernel/trace/bpf_trace.c | 4 + > > > > scripts/bpf_doc.py | 2 + > > > > tools/include/uapi/linux/bpf.h | 39 +++++ > > > > 13 files changed, 398 insertions(+), 3 deletions(-) > > > > create mode 100644 kernel/bpf/bpf_cgroup_storage.c > > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > > index 9e7d46d16032..1395a01c7f18 100644 > > > > --- a/include/linux/bpf.h > > > > +++ b/include/linux/bpf.h > > > > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > > > > > > > const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id > > > > func_id); > > > > void bpf_task_storage_free(struct task_struct *task); > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); > > > > bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); > > > > const struct btf_func_model * > > > > bpf_jit_find_kfunc_model(const struct bpf_prog *prog, > > > > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto > > > > bpf_copy_from_user_task_proto; > > > > extern const struct bpf_func_proto bpf_set_retval_proto; > > > > extern const struct bpf_func_proto bpf_get_retval_proto; > > > > extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; > > > > > > > const struct bpf_func_proto *tracing_prog_func_proto( > > > > enum bpf_func_id func_id, const struct bpf_prog *prog); > > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > > > > index 2c6a4f2562a7..7a0362d7a0aa 100644 > > > > --- a/include/linux/bpf_types.h > > > > +++ b/include/linux/bpf_types.h > > > > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, > > > > cgroup_array_map_ops) > > > > #ifdef CONFIG_CGROUP_BPF > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) > > > > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > > cgroup_local_storage_map_ops) > > > > #endif > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) > > > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > > > > index 4bcf56b3491c..c6f4590dda68 100644 > > > > --- a/include/linux/cgroup-defs.h > > > > +++ b/include/linux/cgroup-defs.h > > > > @@ -504,6 +504,10 @@ struct cgroup { > > > > /* Used to store internal freezer state */ > > > > struct cgroup_freezer_state freezer; > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > + struct bpf_local_storage __rcu *bpf_cgroup_storage; > > > > +#endif > > > > + > > > > /* ids of the ancestors at each level including self */ > > > > u64 ancestor_ids[]; > > > > }; > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > > index 17f61338f8f8..d918b4054297 100644 > > > > --- a/include/uapi/linux/bpf.h > > > > +++ b/include/uapi/linux/bpf.h > > > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > > > BPF_MAP_TYPE_TASK_STORAGE, > > > > BPF_MAP_TYPE_BLOOM_FILTER, > > > > BPF_MAP_TYPE_USER_RINGBUF, > > > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > > }; > > > > > > > /* Note that tracing related programs such as > > > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > > > * **-E2BIG** if user-space has tried to publish a sample which is > > > > * larger than the size of the ring buffer, or which cannot fit > > > > * within a struct bpf_dynptr. > > > > + * > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > > > *cgroup, void *value, u64 flags) > > > > + * Description > > > > + * Get a bpf_local_storage from the *cgroup*. > > > > + * > > > > + * Logically, it could be thought of as getting the value from > > > > + * a *map* with *cgroup* as the **key**. From this > > > > + * perspective, the usage is not much different from > > > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > > > + * helper enforces the key must be a cgroup struct and the map must also > > > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > > > + * > > > > + * Underneath, the value is stored locally at *cgroup* instead of > > > > + * the *map*. The *map* is used as the bpf-local-storage > > > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > > > + * searched against all bpf_local_storage residing at *cgroup*. > > > > + * > > > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > > > + * used such that a new bpf_local_storage will be > > > > + * created if one does not exist. *value* can be used > > > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > > > + * the initial value of a bpf_local_storage. If *value* is > > > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > > > + * Return > > > > + * A bpf_local_storage pointer is returned on success. > > > > + * > > > > + * **NULL** if not found or there was an error in adding > > > > + * a new bpf_local_storage. > > > > + * > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > > > cgroup *cgroup) > > > > + * Description > > > > + * Delete a bpf_local_storage from a *cgroup*. > > > > + * Return > > > > + * 0 on success. > > > > + * > > > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > > > */ > > > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > > > FN(unspec, 0, ##ctx) \ > > > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > > > FN(user_ringbuf_drain, 209, ##ctx) \ > > > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > > > /* */ > > > > > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > > > don't > > > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > > > index 341c94f208f4..b02693f51978 100644 > > > > --- a/kernel/bpf/Makefile > > > > +++ b/kernel/bpf/Makefile > > > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > > > > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > > > > endif > > > > ifeq ($(CONFIG_CGROUPS),y) > > > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > > > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o > > > > endif > > > > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > > > > ifeq ($(CONFIG_INET),y) > > > > diff --git a/kernel/bpf/bpf_cgroup_storage.c > > > > b/kernel/bpf/bpf_cgroup_storage.c > > > > new file mode 100644 > > > > index 000000000000..9974784822da > > > > --- /dev/null > > > > +++ b/kernel/bpf/bpf_cgroup_storage.c > > > > @@ -0,0 +1,280 @@ > > > > +// SPDX-License-Identifier: GPL-2.0 > > > > +/* > > > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > > > > + */ > > > > + > > > > +#include <linux/types.h> > > > > +#include <linux/bpf.h> > > > > +#include <linux/bpf_local_storage.h> > > > > +#include <uapi/linux/btf.h> > > > > +#include <linux/btf_ids.h> > > > > + > > > > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > > > > + > > > > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); > > > > + > > > > +static void bpf_cgroup_storage_lock(void) > > > > +{ > > > > + migrate_disable(); > > > > + this_cpu_inc(bpf_cgroup_storage_busy); > > > > +} > > > > + > > > > +static void bpf_cgroup_storage_unlock(void) > > > > +{ > > > > + this_cpu_dec(bpf_cgroup_storage_busy); > > > > + migrate_enable(); > > > > +} > > > > + > > > > +static bool bpf_cgroup_storage_trylock(void) > > > > +{ > > > > + migrate_disable(); > > > > + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { > > > > + this_cpu_dec(bpf_cgroup_storage_busy); > > > > + migrate_enable(); > > > > + return false; > > > > + } > > > > + return true; > > > > +} > > > > > > Task storage has lock/unlock/trylock; inode storage doesn't; why does > > > cgroup need it as well? > > > > > > > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > > > > +{ > > > > + struct cgroup *cg = owner; > > > > + > > > > + return &cg->bpf_cgroup_storage; > > > > +} > > > > + > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) > > > > +{ > > > > + struct bpf_local_storage *local_storage; > > > > + struct bpf_local_storage_elem *selem; > > > > + bool free_cgroup_storage = false; > > > > + struct hlist_node *n; > > > > + unsigned long flags; > > > > + > > > > + rcu_read_lock(); > > > > + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); > > > > + if (!local_storage) { > > > > + rcu_read_unlock(); > > > > + return; > > > > + } > > > > + > > > > + /* Neither the bpf_prog nor the bpf-map's syscall > > > > + * could be modifying the local_storage->list now. > > > > + * Thus, no elem can be added-to or deleted-from the > > > > + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > > > > + * > > > > + * It is racing with bpf_local_storage_map_free() alone > > > > + * when unlinking elem from the local_storage->list and > > > > + * the map's bucket->list. > > > > + */ > > > > + bpf_cgroup_storage_lock(); > > > > + raw_spin_lock_irqsave(&local_storage->lock, flags); > > > > + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > > > > + bpf_selem_unlink_map(selem); > > > > + free_cgroup_storage = > > > > + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > > > > + } > > > > + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > > > > + bpf_cgroup_storage_unlock(); > > > > + rcu_read_unlock(); > > > > + > > > > + /* free_cgroup_storage should always be true as long as > > > > + * local_storage->list was non-empty. > > > > + */ > > > > + if (free_cgroup_storage) > > > > + kfree_rcu(local_storage, rcu); > > > > +} > > > > > > > +static struct bpf_local_storage_data * > > > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool > > > > cacheit_lockit) > > > > +{ > > > > + struct bpf_local_storage *cgroup_storage; > > > > + struct bpf_local_storage_map *smap; > > > > + > > > > + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, > > > > + bpf_rcu_lock_held()); > > > > + if (!cgroup_storage) > > > > + return NULL; > > > > + > > > > + smap = (struct bpf_local_storage_map *)map; > > > > + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > > > > +} > > > > + > > > > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void > > > > *key) > > > > +{ > > > > + struct bpf_local_storage_data *sdata; > > > > + struct cgroup *cgroup; > > > > + int fd; > > > > + > > > > + fd = *(int *)key; > > > > + cgroup = cgroup_get_from_fd(fd); > > > > + if (IS_ERR(cgroup)) > > > > + return ERR_CAST(cgroup); > > > > + > > > > + bpf_cgroup_storage_lock(); > > > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > > > + bpf_cgroup_storage_unlock(); > > > > + cgroup_put(cgroup); > > > > + return sdata ? sdata->data : NULL; > > > > +} > > > > > > A lot of the above (free/lookup) seems to be copy-pasted from the task > > > storage; > > > any point in trying to generalize the common parts? > > > > > > > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, > > > > + void *value, u64 map_flags) > > > > +{ > > > > + struct bpf_local_storage_data *sdata; > > > > + struct cgroup *cgroup; > > > > + int err, fd; > > > > + > > > > + fd = *(int *)key; > > > > + cgroup = cgroup_get_from_fd(fd); > > > > + if (IS_ERR(cgroup)) > > > > + return PTR_ERR(cgroup); > > > > + > > > > + bpf_cgroup_storage_lock(); > > > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > > > *)map, > > > > + value, map_flags, GFP_ATOMIC); > > > > + bpf_cgroup_storage_unlock(); > > > > + err = PTR_ERR_OR_ZERO(sdata); > > > > + cgroup_put(cgroup); > > > > + return err; > > > > +} > > > > + > > > > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map > > > > *map) > > > > +{ > > > > + struct bpf_local_storage_data *sdata; > > > > + > > > > + sdata = cgroup_storage_lookup(cgroup, map, false); > > > > + if (!sdata) > > > > + return -ENOENT; > > > > + > > > > + bpf_selem_unlink(SELEM(sdata), true); > > > > + return 0; > > > > +} > > > > + > > > > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key) > > > > +{ > > > > + struct cgroup *cgroup; > > > > + int err, fd; > > > > + > > > > + fd = *(int *)key; > > > > + cgroup = cgroup_get_from_fd(fd); > > > > + if (IS_ERR(cgroup)) > > > > + return PTR_ERR(cgroup); > > > > + > > > > + bpf_cgroup_storage_lock(); > > > > + err = cgroup_storage_delete(cgroup, map); > > > > + bpf_cgroup_storage_unlock(); > > > > + if (err) > > > > + return err; > > > > + > > > > + cgroup_put(cgroup); > > > > + return 0; > > > > +} > > > > + > > > > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void > > > > *next_key) > > > > +{ > > > > + return -ENOTSUPP; > > > > +} > > > > + > > > > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) > > > > +{ > > > > + struct bpf_local_storage_map *smap; > > > > + > > > > + smap = bpf_local_storage_map_alloc(attr); > > > > + if (IS_ERR(smap)) > > > > + return ERR_CAST(smap); > > > > + > > > > + smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache); > > > > + return &smap->map; > > > > +} > > > > + > > > > +static void cgroup_storage_map_free(struct bpf_map *map) > > > > +{ > > > > + struct bpf_local_storage_map *smap; > > > > + > > > > + smap = (struct bpf_local_storage_map *)map; > > > > + bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx); > > > > + bpf_local_storage_map_free(smap, NULL); > > > > +} > > > > + > > > > +/* *gfp_flags* is a hidden argument provided by the verifier */ > > > > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup > > > > *, cgroup, > > > > + void *, value, u64, flags, gfp_t, gfp_flags) > > > > +{ > > > > + struct bpf_local_storage_data *sdata; > > > > + > > > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > > > + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) > > > > + return (unsigned long)NULL; > > > > + > > > > + if (!cgroup) > > > > + return (unsigned long)NULL; > > > > + > > > > + if (!bpf_cgroup_storage_trylock()) > > > > + return (unsigned long)NULL; > > > > + > > > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > > > + if (sdata) > > > > + goto unlock; > > > > + > > > > + /* only allocate new storage, when the cgroup is refcounted */ > > > > + if (!percpu_ref_is_dying(&cgroup->self.refcnt) && > > > > + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) > > > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > > > *)map, > > > > + value, BPF_NOEXIST, gfp_flags); > > > > + > > > > +unlock: > > > > + bpf_cgroup_storage_unlock(); > > > > + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned > > > > long)sdata->data; > > > > +} > > > > + > > > > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct > > > > cgroup *, cgroup) > > > > +{ > > > > + int ret; > > > > + > > > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > > > + if (!cgroup) > > > > + return -EINVAL; > > > > + > > > > + if (!bpf_cgroup_storage_trylock()) > > > > + return -EBUSY; > > > > + > > > > + ret = cgroup_storage_delete(cgroup, map); > > > > + bpf_cgroup_storage_unlock(); > > > > + return ret; > > > > +} > > > > + > > > > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, > > > > bpf_local_storage_map) > > > > +const struct bpf_map_ops cgroup_local_storage_map_ops = { > > > > + .map_meta_equal = bpf_map_meta_equal, > > > > + .map_alloc_check = bpf_local_storage_map_alloc_check, > > > > + .map_alloc = cgroup_storage_map_alloc, > > > > + .map_free = cgroup_storage_map_free, > > > > + .map_get_next_key = notsupp_get_next_key, > > > > + .map_lookup_elem = bpf_cgroup_storage_lookup_elem, > > > > + .map_update_elem = bpf_cgroup_storage_update_elem, > > > > + .map_delete_elem = bpf_cgroup_storage_delete_elem, > > > > + .map_check_btf = bpf_local_storage_map_check_btf, > > > > + .map_btf_id = &cgroup_storage_map_btf_ids[0], > > > > + .map_owner_storage_ptr = cgroup_storage_ptr, > > > > +}; > > > > + > > > > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = { > > > > + .func = bpf_cgroup_storage_get, > > > > + .gpl_only = false, > > > > + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, > > > > + .arg1_type = ARG_CONST_MAP_PTR, > > > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > > > + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, > > > > + .arg4_type = ARG_ANYTHING, > > > > +}; > > > > + > > > > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = { > > > > + .func = bpf_cgroup_storage_delete, > > > > + .gpl_only = false, > > > > + .ret_type = RET_INTEGER, > > > > + .arg1_type = ARG_CONST_MAP_PTR, > > > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > > > +}; > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > > index a6b04faed282..5c5bb08832ec 100644 > > > > --- a/kernel/bpf/helpers.c > > > > +++ b/kernel/bpf/helpers.c > > > > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) > > > > return &bpf_dynptr_write_proto; > > > > case BPF_FUNC_dynptr_data: > > > > return &bpf_dynptr_data_proto; > > > > +#ifdef CONFIG_CGROUPS > > > > + case BPF_FUNC_cgroup_local_storage_get: > > > > + return &bpf_cgroup_storage_get_proto; > > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > > + return &bpf_cgroup_storage_delete_proto; > > > > +#endif > > > > default: > > > > break; > > > > } > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > > index 7b373a5e861f..e53c7fae6e22 100644 > > > > --- a/kernel/bpf/syscall.c > > > > +++ b/kernel/bpf/syscall.c > > > > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const > > > > struct btf *btf, > > > > map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && > > > > map->map_type != BPF_MAP_TYPE_SK_STORAGE && > > > > map->map_type != BPF_MAP_TYPE_INODE_STORAGE && > > > > - map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > > > + map->map_type != BPF_MAP_TYPE_TASK_STORAGE && > > > > + map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > > > return -ENOTSUPP; > > > > if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > > > > > map->value_size) { > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > > > index 6f6d2d511c06..f36f6a3c0d50 100644 > > > > --- a/kernel/bpf/verifier.c > > > > +++ b/kernel/bpf/verifier.c > > > > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct > > > > bpf_verifier_env *env, > > > > func_id != BPF_FUNC_task_storage_delete) > > > > goto error; > > > > break; > > > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > > > + if (func_id != BPF_FUNC_cgroup_local_storage_get && > > > > + func_id != BPF_FUNC_cgroup_local_storage_delete) > > > > + goto error; > > > > + break; > > > > case BPF_MAP_TYPE_BLOOM_FILTER: > > > > if (func_id != BPF_FUNC_map_peek_elem && > > > > func_id != BPF_FUNC_map_push_elem) > > > > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct > > > > bpf_verifier_env *env, > > > > if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > > > goto error; > > > > break; > > > > + case BPF_FUNC_cgroup_local_storage_get: > > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > > + if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > > > + goto error; > > > > + break; > > > > default: > > > > break; > > > > } > > > > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct > > > > bpf_verifier_env *env, > > > > case BPF_MAP_TYPE_INODE_STORAGE: > > > > case BPF_MAP_TYPE_SK_STORAGE: > > > > case BPF_MAP_TYPE_TASK_STORAGE: > > > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > > > break; > > > > default: > > > > verbose(env, > > > > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env > > > > *env) > > > > > > > if (insn->imm == BPF_FUNC_task_storage_get || > > > > insn->imm == BPF_FUNC_sk_storage_get || > > > > - insn->imm == BPF_FUNC_inode_storage_get) { > > > > + insn->imm == BPF_FUNC_inode_storage_get || > > > > + insn->imm == BPF_FUNC_cgroup_local_storage_get) { > > > > if (env->prog->aux->sleepable) > > > > insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL); > > > > else > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > > > index 8ad2c267ff47..2fa2c950c7fb 100644 > > > > --- a/kernel/cgroup/cgroup.c > > > > +++ b/kernel/cgroup/cgroup.c > > > > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) > > > > put_css_set_locked(cset->dom_cset); > > > > } > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > + bpf_local_cgroup_storage_free(cset->dfl_cgrp); > > > > +#endif > > > > + > > > > I am confused about this freeing site. It seems like this path is for > > freeing css_set's of task_structs, not for freeing the cgroup itself. > > Wouldn't we want to free the local storage when we free the cgroup > > itself? Somewhere like css_free_rwork_fn()? or did I completely miss > > the point here? > > > > > > kfree_rcu(cset, rcu_head); > > > > } > > > > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > > > index 688552df95ca..179adaae4a9f 100644 > > > > --- a/kernel/trace/bpf_trace.c > > > > +++ b/kernel/trace/bpf_trace.c > > > > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, > > > > const struct bpf_prog *prog) > > > > return &bpf_get_current_cgroup_id_proto; > > > > case BPF_FUNC_get_current_ancestor_cgroup_id: > > > > return &bpf_get_current_ancestor_cgroup_id_proto; > > > > + case BPF_FUNC_cgroup_local_storage_get: > > > > + return &bpf_cgroup_storage_get_proto; > > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > > + return &bpf_cgroup_storage_delete_proto; > > > > #endif > > > > case BPF_FUNC_send_signal: > > > > return &bpf_send_signal_proto; > > > > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py > > > > index c0e6690be82a..fdb0aff8cb5a 100755 > > > > --- a/scripts/bpf_doc.py > > > > +++ b/scripts/bpf_doc.py > > > > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer): > > > > 'struct udp6_sock', > > > > 'struct unix_sock', > > > > 'struct task_struct', > > > > + 'struct cgroup', > > > > > > > 'struct __sk_buff', > > > > 'struct sk_msg_md', > > > > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer): > > > > 'struct udp6_sock', > > > > 'struct unix_sock', > > > > 'struct task_struct', > > > > + 'struct cgroup', > > > > 'struct path', > > > > 'struct btf_ptr', > > > > 'struct inode', > > > > diff --git a/tools/include/uapi/linux/bpf.h > > > > b/tools/include/uapi/linux/bpf.h > > > > index 17f61338f8f8..d918b4054297 100644 > > > > --- a/tools/include/uapi/linux/bpf.h > > > > +++ b/tools/include/uapi/linux/bpf.h > > > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > > > BPF_MAP_TYPE_TASK_STORAGE, > > > > BPF_MAP_TYPE_BLOOM_FILTER, > > > > BPF_MAP_TYPE_USER_RINGBUF, > > > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > > }; > > > > > > > /* Note that tracing related programs such as > > > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > > > * **-E2BIG** if user-space has tried to publish a sample which is > > > > * larger than the size of the ring buffer, or which cannot fit > > > > * within a struct bpf_dynptr. > > > > + * > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > > > *cgroup, void *value, u64 flags) > > > > + * Description > > > > + * Get a bpf_local_storage from the *cgroup*. > > > > + * > > > > + * Logically, it could be thought of as getting the value from > > > > + * a *map* with *cgroup* as the **key**. From this > > > > + * perspective, the usage is not much different from > > > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > > > + * helper enforces the key must be a cgroup struct and the map must also > > > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > > > + * > > > > + * Underneath, the value is stored locally at *cgroup* instead of > > > > + * the *map*. The *map* is used as the bpf-local-storage > > > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > > > + * searched against all bpf_local_storage residing at *cgroup*. > > > > + * > > > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > > > + * used such that a new bpf_local_storage will be > > > > + * created if one does not exist. *value* can be used > > > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > > > + * the initial value of a bpf_local_storage. If *value* is > > > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > > > + * Return > > > > + * A bpf_local_storage pointer is returned on success. > > > > + * > > > > + * **NULL** if not found or there was an error in adding > > > > + * a new bpf_local_storage. > > > > + * > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > > > cgroup *cgroup) > > > > + * Description > > > > + * Delete a bpf_local_storage from a *cgroup*. > > > > + * Return > > > > + * 0 on success. > > > > + * > > > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > > > */ > > > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > > > FN(unspec, 0, ##ctx) \ > > > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > > > FN(user_ringbuf_drain, 209, ##ctx) \ > > > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > > > /* */ > > > > > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > > > don't > > > > -- > > > > 2.30.2 > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:47 ` Yosry Ahmed @ 2022-10-17 19:07 ` Stanislav Fomichev 2022-10-17 19:11 ` Yosry Ahmed 2022-10-17 20:15 ` Yonghong Song 1 sibling, 1 reply; 38+ messages in thread From: Stanislav Fomichev @ 2022-10-17 19:07 UTC (permalink / raw) To: Yosry Ahmed Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > > > > > > > > On 10/13, Yonghong Song wrote: > > > > > Similar to sk/inode/task storage, implement similar cgroup local storage. > > > > > > > > > There already exists a local storage implementation for cgroup-attached > > > > > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > > > > > bpf_get_local_storage(). But there are use cases such that non-cgroup > > > > > attached bpf progs wants to access cgroup local storage data. For example, > > > > > tc egress prog has access to sk and cgroup. It is possible to use > > > > > sk local storage to emulate cgroup local storage by storing data in > > > > > socket. > > > > > But this is a waste as it could be lots of sockets belonging to a > > > > > particular > > > > > cgroup. Alternatively, a separate map can be created with cgroup id as > > > > > the key. > > > > > But this will introduce additional overhead to manipulate the new map. > > > > > A cgroup local storage, similar to existing sk/inode/task storage, > > > > > should help for this use case. > > > > > > > > > The life-cycle of storage is managed with the life-cycle of the > > > > > cgroup struct. i.e. the storage is destroyed along with the owning cgroup > > > > > with a callback to the bpf_cgroup_storage_free when cgroup itself > > > > > is deleted. > > > > > > > > > The userspace map operations can be done by using a cgroup fd as a key > > > > > passed to the lookup, update and delete operations. > > > > > > > > > > > > [..] > > > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > > > > > local > > > > > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > > > > > used > > > > > for cgroup storage available to non-cgroup-attached bpf programs. The two > > > > > helpers are named as bpf_cgroup_local_storage_get() and > > > > > bpf_cgroup_local_storage_delete(). > > > > > > > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make > > > > cgroup storages shared between programs on the same cgroup") where > > > > the map changes its behavior depending on the key size (see key_size checks > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > > > > can be used so we can, in theory, reuse the name.. > > > > > > > > Pros: > > > > - no need for a new map name > > > > > > > > Cons: > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > > > > good idea to add more stuff to it? > > > > > > > > But, for the very least, should we also extend > > > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > > > > tried to keep some of the important details in there.. > > > > > > This might be a long shot, but is it possible to switch completely to > > > this new generic cgroup storage, and for programs that attach to > > > cgroups we can still do lookups/allocations during attachment like we > > > do today? IOW, maintain the current API for cgroup progs but switch it > > > to use this new map type instead. > > > > > > It feels like this map type is more generic and can be a superset of > > > the existing cgroup storage, but I feel like I am missing something. > > > > I feel like the biggest issue is that the existing > > bpf_get_local_storage helper is guaranteed to always return non-null > > and the verifier doesn't require the programs to do null checks on it; > > the new helper might return NULL making all existing programs fail the > > verifier. > > What I meant is, keep the old bpf_get_local_storage helper only for > cgroup-attached programs like we have today, and add a new generic > bpf_cgroup_local_storage_get() helper. > > For cgroup-attached programs, make sure a cgroup storage entry is > allocated and hooked to the helper on program attach time, to keep > today's behavior constant. > > For other programs, the bpf_cgroup_local_storage_get() will do the > normal lookup and allocate if necessary. > > Does this make any sense to you? But then you also need to somehow mark these to make sure it's not possible to delete them as long as the program is loaded/attached? Not saying it's impossible, but it's a bit of a departure from the existing common local storage framework used by inode/task; not sure whether we want to pull all this complexity in there? But we can definitely try if there is a wider agreement.. > > There might be something else I don't remember at this point (besides > > that weird per-prog_type that we'd have to emulate as well).. > > Yeah there are things that will need to be emulated, but I feel like > we may end up with less confusing code (and less code in general). > > > > > > > > > > > > Signed-off-by: Yonghong Song <yhs@fb.com> > > > > > --- > > > > > include/linux/bpf.h | 3 + > > > > > include/linux/bpf_types.h | 1 + > > > > > include/linux/cgroup-defs.h | 4 + > > > > > include/uapi/linux/bpf.h | 39 +++++ > > > > > kernel/bpf/Makefile | 2 +- > > > > > kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > > > > > kernel/bpf/helpers.c | 6 + > > > > > kernel/bpf/syscall.c | 3 +- > > > > > kernel/bpf/verifier.c | 14 +- > > > > > kernel/cgroup/cgroup.c | 4 + > > > > > kernel/trace/bpf_trace.c | 4 + > > > > > scripts/bpf_doc.py | 2 + > > > > > tools/include/uapi/linux/bpf.h | 39 +++++ > > > > > 13 files changed, 398 insertions(+), 3 deletions(-) > > > > > create mode 100644 kernel/bpf/bpf_cgroup_storage.c > > > > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > > > index 9e7d46d16032..1395a01c7f18 100644 > > > > > --- a/include/linux/bpf.h > > > > > +++ b/include/linux/bpf.h > > > > > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > > > > > > > > > const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id > > > > > func_id); > > > > > void bpf_task_storage_free(struct task_struct *task); > > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); > > > > > bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); > > > > > const struct btf_func_model * > > > > > bpf_jit_find_kfunc_model(const struct bpf_prog *prog, > > > > > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto > > > > > bpf_copy_from_user_task_proto; > > > > > extern const struct bpf_func_proto bpf_set_retval_proto; > > > > > extern const struct bpf_func_proto bpf_get_retval_proto; > > > > > extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; > > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; > > > > > > > > > const struct bpf_func_proto *tracing_prog_func_proto( > > > > > enum bpf_func_id func_id, const struct bpf_prog *prog); > > > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > > > > > index 2c6a4f2562a7..7a0362d7a0aa 100644 > > > > > --- a/include/linux/bpf_types.h > > > > > +++ b/include/linux/bpf_types.h > > > > > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, > > > > > cgroup_array_map_ops) > > > > > #ifdef CONFIG_CGROUP_BPF > > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) > > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) > > > > > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > > > cgroup_local_storage_map_ops) > > > > > #endif > > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) > > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) > > > > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > > > > > index 4bcf56b3491c..c6f4590dda68 100644 > > > > > --- a/include/linux/cgroup-defs.h > > > > > +++ b/include/linux/cgroup-defs.h > > > > > @@ -504,6 +504,10 @@ struct cgroup { > > > > > /* Used to store internal freezer state */ > > > > > struct cgroup_freezer_state freezer; > > > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > > + struct bpf_local_storage __rcu *bpf_cgroup_storage; > > > > > +#endif > > > > > + > > > > > /* ids of the ancestors at each level including self */ > > > > > u64 ancestor_ids[]; > > > > > }; > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > > > index 17f61338f8f8..d918b4054297 100644 > > > > > --- a/include/uapi/linux/bpf.h > > > > > +++ b/include/uapi/linux/bpf.h > > > > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > > > > BPF_MAP_TYPE_TASK_STORAGE, > > > > > BPF_MAP_TYPE_BLOOM_FILTER, > > > > > BPF_MAP_TYPE_USER_RINGBUF, > > > > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > > > }; > > > > > > > > > /* Note that tracing related programs such as > > > > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > > > > * **-E2BIG** if user-space has tried to publish a sample which is > > > > > * larger than the size of the ring buffer, or which cannot fit > > > > > * within a struct bpf_dynptr. > > > > > + * > > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > > > > *cgroup, void *value, u64 flags) > > > > > + * Description > > > > > + * Get a bpf_local_storage from the *cgroup*. > > > > > + * > > > > > + * Logically, it could be thought of as getting the value from > > > > > + * a *map* with *cgroup* as the **key**. From this > > > > > + * perspective, the usage is not much different from > > > > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > > > > + * helper enforces the key must be a cgroup struct and the map must also > > > > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > > > > + * > > > > > + * Underneath, the value is stored locally at *cgroup* instead of > > > > > + * the *map*. The *map* is used as the bpf-local-storage > > > > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > > > > + * searched against all bpf_local_storage residing at *cgroup*. > > > > > + * > > > > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > > > > + * used such that a new bpf_local_storage will be > > > > > + * created if one does not exist. *value* can be used > > > > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > > > > + * the initial value of a bpf_local_storage. If *value* is > > > > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > > > > + * Return > > > > > + * A bpf_local_storage pointer is returned on success. > > > > > + * > > > > > + * **NULL** if not found or there was an error in adding > > > > > + * a new bpf_local_storage. > > > > > + * > > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > > > > cgroup *cgroup) > > > > > + * Description > > > > > + * Delete a bpf_local_storage from a *cgroup*. > > > > > + * Return > > > > > + * 0 on success. > > > > > + * > > > > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > > > > */ > > > > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > > > > FN(unspec, 0, ##ctx) \ > > > > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > > > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > > > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > > > > FN(user_ringbuf_drain, 209, ##ctx) \ > > > > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > > > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > > > > /* */ > > > > > > > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > > > > don't > > > > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > > > > index 341c94f208f4..b02693f51978 100644 > > > > > --- a/kernel/bpf/Makefile > > > > > +++ b/kernel/bpf/Makefile > > > > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > > > > > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > > > > > endif > > > > > ifeq ($(CONFIG_CGROUPS),y) > > > > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > > > > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o > > > > > endif > > > > > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > > > > > ifeq ($(CONFIG_INET),y) > > > > > diff --git a/kernel/bpf/bpf_cgroup_storage.c > > > > > b/kernel/bpf/bpf_cgroup_storage.c > > > > > new file mode 100644 > > > > > index 000000000000..9974784822da > > > > > --- /dev/null > > > > > +++ b/kernel/bpf/bpf_cgroup_storage.c > > > > > @@ -0,0 +1,280 @@ > > > > > +// SPDX-License-Identifier: GPL-2.0 > > > > > +/* > > > > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > > > > > + */ > > > > > + > > > > > +#include <linux/types.h> > > > > > +#include <linux/bpf.h> > > > > > +#include <linux/bpf_local_storage.h> > > > > > +#include <uapi/linux/btf.h> > > > > > +#include <linux/btf_ids.h> > > > > > + > > > > > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > > > > > + > > > > > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); > > > > > + > > > > > +static void bpf_cgroup_storage_lock(void) > > > > > +{ > > > > > + migrate_disable(); > > > > > + this_cpu_inc(bpf_cgroup_storage_busy); > > > > > +} > > > > > + > > > > > +static void bpf_cgroup_storage_unlock(void) > > > > > +{ > > > > > + this_cpu_dec(bpf_cgroup_storage_busy); > > > > > + migrate_enable(); > > > > > +} > > > > > + > > > > > +static bool bpf_cgroup_storage_trylock(void) > > > > > +{ > > > > > + migrate_disable(); > > > > > + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { > > > > > + this_cpu_dec(bpf_cgroup_storage_busy); > > > > > + migrate_enable(); > > > > > + return false; > > > > > + } > > > > > + return true; > > > > > +} > > > > > > > > Task storage has lock/unlock/trylock; inode storage doesn't; why does > > > > cgroup need it as well? > > > > > > > > > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > > > > > +{ > > > > > + struct cgroup *cg = owner; > > > > > + > > > > > + return &cg->bpf_cgroup_storage; > > > > > +} > > > > > + > > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) > > > > > +{ > > > > > + struct bpf_local_storage *local_storage; > > > > > + struct bpf_local_storage_elem *selem; > > > > > + bool free_cgroup_storage = false; > > > > > + struct hlist_node *n; > > > > > + unsigned long flags; > > > > > + > > > > > + rcu_read_lock(); > > > > > + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); > > > > > + if (!local_storage) { > > > > > + rcu_read_unlock(); > > > > > + return; > > > > > + } > > > > > + > > > > > + /* Neither the bpf_prog nor the bpf-map's syscall > > > > > + * could be modifying the local_storage->list now. > > > > > + * Thus, no elem can be added-to or deleted-from the > > > > > + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > > > > > + * > > > > > + * It is racing with bpf_local_storage_map_free() alone > > > > > + * when unlinking elem from the local_storage->list and > > > > > + * the map's bucket->list. > > > > > + */ > > > > > + bpf_cgroup_storage_lock(); > > > > > + raw_spin_lock_irqsave(&local_storage->lock, flags); > > > > > + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > > > > > + bpf_selem_unlink_map(selem); > > > > > + free_cgroup_storage = > > > > > + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > > > > > + } > > > > > + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > > > > > + bpf_cgroup_storage_unlock(); > > > > > + rcu_read_unlock(); > > > > > + > > > > > + /* free_cgroup_storage should always be true as long as > > > > > + * local_storage->list was non-empty. > > > > > + */ > > > > > + if (free_cgroup_storage) > > > > > + kfree_rcu(local_storage, rcu); > > > > > +} > > > > > > > > > +static struct bpf_local_storage_data * > > > > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool > > > > > cacheit_lockit) > > > > > +{ > > > > > + struct bpf_local_storage *cgroup_storage; > > > > > + struct bpf_local_storage_map *smap; > > > > > + > > > > > + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, > > > > > + bpf_rcu_lock_held()); > > > > > + if (!cgroup_storage) > > > > > + return NULL; > > > > > + > > > > > + smap = (struct bpf_local_storage_map *)map; > > > > > + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > > > > > +} > > > > > + > > > > > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void > > > > > *key) > > > > > +{ > > > > > + struct bpf_local_storage_data *sdata; > > > > > + struct cgroup *cgroup; > > > > > + int fd; > > > > > + > > > > > + fd = *(int *)key; > > > > > + cgroup = cgroup_get_from_fd(fd); > > > > > + if (IS_ERR(cgroup)) > > > > > + return ERR_CAST(cgroup); > > > > > + > > > > > + bpf_cgroup_storage_lock(); > > > > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > > > > + bpf_cgroup_storage_unlock(); > > > > > + cgroup_put(cgroup); > > > > > + return sdata ? sdata->data : NULL; > > > > > +} > > > > > > > > A lot of the above (free/lookup) seems to be copy-pasted from the task > > > > storage; > > > > any point in trying to generalize the common parts? > > > > > > > > > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, > > > > > + void *value, u64 map_flags) > > > > > +{ > > > > > + struct bpf_local_storage_data *sdata; > > > > > + struct cgroup *cgroup; > > > > > + int err, fd; > > > > > + > > > > > + fd = *(int *)key; > > > > > + cgroup = cgroup_get_from_fd(fd); > > > > > + if (IS_ERR(cgroup)) > > > > > + return PTR_ERR(cgroup); > > > > > + > > > > > + bpf_cgroup_storage_lock(); > > > > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > > > > *)map, > > > > > + value, map_flags, GFP_ATOMIC); > > > > > + bpf_cgroup_storage_unlock(); > > > > > + err = PTR_ERR_OR_ZERO(sdata); > > > > > + cgroup_put(cgroup); > > > > > + return err; > > > > > +} > > > > > + > > > > > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map > > > > > *map) > > > > > +{ > > > > > + struct bpf_local_storage_data *sdata; > > > > > + > > > > > + sdata = cgroup_storage_lookup(cgroup, map, false); > > > > > + if (!sdata) > > > > > + return -ENOENT; > > > > > + > > > > > + bpf_selem_unlink(SELEM(sdata), true); > > > > > + return 0; > > > > > +} > > > > > + > > > > > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key) > > > > > +{ > > > > > + struct cgroup *cgroup; > > > > > + int err, fd; > > > > > + > > > > > + fd = *(int *)key; > > > > > + cgroup = cgroup_get_from_fd(fd); > > > > > + if (IS_ERR(cgroup)) > > > > > + return PTR_ERR(cgroup); > > > > > + > > > > > + bpf_cgroup_storage_lock(); > > > > > + err = cgroup_storage_delete(cgroup, map); > > > > > + bpf_cgroup_storage_unlock(); > > > > > + if (err) > > > > > + return err; > > > > > + > > > > > + cgroup_put(cgroup); > > > > > + return 0; > > > > > +} > > > > > + > > > > > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void > > > > > *next_key) > > > > > +{ > > > > > + return -ENOTSUPP; > > > > > +} > > > > > + > > > > > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) > > > > > +{ > > > > > + struct bpf_local_storage_map *smap; > > > > > + > > > > > + smap = bpf_local_storage_map_alloc(attr); > > > > > + if (IS_ERR(smap)) > > > > > + return ERR_CAST(smap); > > > > > + > > > > > + smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache); > > > > > + return &smap->map; > > > > > +} > > > > > + > > > > > +static void cgroup_storage_map_free(struct bpf_map *map) > > > > > +{ > > > > > + struct bpf_local_storage_map *smap; > > > > > + > > > > > + smap = (struct bpf_local_storage_map *)map; > > > > > + bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx); > > > > > + bpf_local_storage_map_free(smap, NULL); > > > > > +} > > > > > + > > > > > +/* *gfp_flags* is a hidden argument provided by the verifier */ > > > > > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup > > > > > *, cgroup, > > > > > + void *, value, u64, flags, gfp_t, gfp_flags) > > > > > +{ > > > > > + struct bpf_local_storage_data *sdata; > > > > > + > > > > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > > > > + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) > > > > > + return (unsigned long)NULL; > > > > > + > > > > > + if (!cgroup) > > > > > + return (unsigned long)NULL; > > > > > + > > > > > + if (!bpf_cgroup_storage_trylock()) > > > > > + return (unsigned long)NULL; > > > > > + > > > > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > > > > + if (sdata) > > > > > + goto unlock; > > > > > + > > > > > + /* only allocate new storage, when the cgroup is refcounted */ > > > > > + if (!percpu_ref_is_dying(&cgroup->self.refcnt) && > > > > > + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) > > > > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > > > > *)map, > > > > > + value, BPF_NOEXIST, gfp_flags); > > > > > + > > > > > +unlock: > > > > > + bpf_cgroup_storage_unlock(); > > > > > + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned > > > > > long)sdata->data; > > > > > +} > > > > > + > > > > > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct > > > > > cgroup *, cgroup) > > > > > +{ > > > > > + int ret; > > > > > + > > > > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > > > > + if (!cgroup) > > > > > + return -EINVAL; > > > > > + > > > > > + if (!bpf_cgroup_storage_trylock()) > > > > > + return -EBUSY; > > > > > + > > > > > + ret = cgroup_storage_delete(cgroup, map); > > > > > + bpf_cgroup_storage_unlock(); > > > > > + return ret; > > > > > +} > > > > > + > > > > > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, > > > > > bpf_local_storage_map) > > > > > +const struct bpf_map_ops cgroup_local_storage_map_ops = { > > > > > + .map_meta_equal = bpf_map_meta_equal, > > > > > + .map_alloc_check = bpf_local_storage_map_alloc_check, > > > > > + .map_alloc = cgroup_storage_map_alloc, > > > > > + .map_free = cgroup_storage_map_free, > > > > > + .map_get_next_key = notsupp_get_next_key, > > > > > + .map_lookup_elem = bpf_cgroup_storage_lookup_elem, > > > > > + .map_update_elem = bpf_cgroup_storage_update_elem, > > > > > + .map_delete_elem = bpf_cgroup_storage_delete_elem, > > > > > + .map_check_btf = bpf_local_storage_map_check_btf, > > > > > + .map_btf_id = &cgroup_storage_map_btf_ids[0], > > > > > + .map_owner_storage_ptr = cgroup_storage_ptr, > > > > > +}; > > > > > + > > > > > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = { > > > > > + .func = bpf_cgroup_storage_get, > > > > > + .gpl_only = false, > > > > > + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, > > > > > + .arg1_type = ARG_CONST_MAP_PTR, > > > > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > > > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > > > > + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, > > > > > + .arg4_type = ARG_ANYTHING, > > > > > +}; > > > > > + > > > > > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = { > > > > > + .func = bpf_cgroup_storage_delete, > > > > > + .gpl_only = false, > > > > > + .ret_type = RET_INTEGER, > > > > > + .arg1_type = ARG_CONST_MAP_PTR, > > > > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > > > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > > > > +}; > > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > > > index a6b04faed282..5c5bb08832ec 100644 > > > > > --- a/kernel/bpf/helpers.c > > > > > +++ b/kernel/bpf/helpers.c > > > > > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) > > > > > return &bpf_dynptr_write_proto; > > > > > case BPF_FUNC_dynptr_data: > > > > > return &bpf_dynptr_data_proto; > > > > > +#ifdef CONFIG_CGROUPS > > > > > + case BPF_FUNC_cgroup_local_storage_get: > > > > > + return &bpf_cgroup_storage_get_proto; > > > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > > > + return &bpf_cgroup_storage_delete_proto; > > > > > +#endif > > > > > default: > > > > > break; > > > > > } > > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > > > index 7b373a5e861f..e53c7fae6e22 100644 > > > > > --- a/kernel/bpf/syscall.c > > > > > +++ b/kernel/bpf/syscall.c > > > > > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const > > > > > struct btf *btf, > > > > > map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && > > > > > map->map_type != BPF_MAP_TYPE_SK_STORAGE && > > > > > map->map_type != BPF_MAP_TYPE_INODE_STORAGE && > > > > > - map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > > > > + map->map_type != BPF_MAP_TYPE_TASK_STORAGE && > > > > > + map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > > > > return -ENOTSUPP; > > > > > if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > > > > > > map->value_size) { > > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > > > > index 6f6d2d511c06..f36f6a3c0d50 100644 > > > > > --- a/kernel/bpf/verifier.c > > > > > +++ b/kernel/bpf/verifier.c > > > > > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct > > > > > bpf_verifier_env *env, > > > > > func_id != BPF_FUNC_task_storage_delete) > > > > > goto error; > > > > > break; > > > > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > > > > + if (func_id != BPF_FUNC_cgroup_local_storage_get && > > > > > + func_id != BPF_FUNC_cgroup_local_storage_delete) > > > > > + goto error; > > > > > + break; > > > > > case BPF_MAP_TYPE_BLOOM_FILTER: > > > > > if (func_id != BPF_FUNC_map_peek_elem && > > > > > func_id != BPF_FUNC_map_push_elem) > > > > > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct > > > > > bpf_verifier_env *env, > > > > > if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > > > > goto error; > > > > > break; > > > > > + case BPF_FUNC_cgroup_local_storage_get: > > > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > > > + if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > > > > + goto error; > > > > > + break; > > > > > default: > > > > > break; > > > > > } > > > > > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct > > > > > bpf_verifier_env *env, > > > > > case BPF_MAP_TYPE_INODE_STORAGE: > > > > > case BPF_MAP_TYPE_SK_STORAGE: > > > > > case BPF_MAP_TYPE_TASK_STORAGE: > > > > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > > > > break; > > > > > default: > > > > > verbose(env, > > > > > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env > > > > > *env) > > > > > > > > > if (insn->imm == BPF_FUNC_task_storage_get || > > > > > insn->imm == BPF_FUNC_sk_storage_get || > > > > > - insn->imm == BPF_FUNC_inode_storage_get) { > > > > > + insn->imm == BPF_FUNC_inode_storage_get || > > > > > + insn->imm == BPF_FUNC_cgroup_local_storage_get) { > > > > > if (env->prog->aux->sleepable) > > > > > insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL); > > > > > else > > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > > > > index 8ad2c267ff47..2fa2c950c7fb 100644 > > > > > --- a/kernel/cgroup/cgroup.c > > > > > +++ b/kernel/cgroup/cgroup.c > > > > > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) > > > > > put_css_set_locked(cset->dom_cset); > > > > > } > > > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > > + bpf_local_cgroup_storage_free(cset->dfl_cgrp); > > > > > +#endif > > > > > + > > > > > > I am confused about this freeing site. It seems like this path is for > > > freeing css_set's of task_structs, not for freeing the cgroup itself. > > > Wouldn't we want to free the local storage when we free the cgroup > > > itself? Somewhere like css_free_rwork_fn()? or did I completely miss > > > the point here? > > > > > > > > kfree_rcu(cset, rcu_head); > > > > > } > > > > > > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > > > > index 688552df95ca..179adaae4a9f 100644 > > > > > --- a/kernel/trace/bpf_trace.c > > > > > +++ b/kernel/trace/bpf_trace.c > > > > > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, > > > > > const struct bpf_prog *prog) > > > > > return &bpf_get_current_cgroup_id_proto; > > > > > case BPF_FUNC_get_current_ancestor_cgroup_id: > > > > > return &bpf_get_current_ancestor_cgroup_id_proto; > > > > > + case BPF_FUNC_cgroup_local_storage_get: > > > > > + return &bpf_cgroup_storage_get_proto; > > > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > > > + return &bpf_cgroup_storage_delete_proto; > > > > > #endif > > > > > case BPF_FUNC_send_signal: > > > > > return &bpf_send_signal_proto; > > > > > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py > > > > > index c0e6690be82a..fdb0aff8cb5a 100755 > > > > > --- a/scripts/bpf_doc.py > > > > > +++ b/scripts/bpf_doc.py > > > > > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer): > > > > > 'struct udp6_sock', > > > > > 'struct unix_sock', > > > > > 'struct task_struct', > > > > > + 'struct cgroup', > > > > > > > > > 'struct __sk_buff', > > > > > 'struct sk_msg_md', > > > > > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer): > > > > > 'struct udp6_sock', > > > > > 'struct unix_sock', > > > > > 'struct task_struct', > > > > > + 'struct cgroup', > > > > > 'struct path', > > > > > 'struct btf_ptr', > > > > > 'struct inode', > > > > > diff --git a/tools/include/uapi/linux/bpf.h > > > > > b/tools/include/uapi/linux/bpf.h > > > > > index 17f61338f8f8..d918b4054297 100644 > > > > > --- a/tools/include/uapi/linux/bpf.h > > > > > +++ b/tools/include/uapi/linux/bpf.h > > > > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > > > > BPF_MAP_TYPE_TASK_STORAGE, > > > > > BPF_MAP_TYPE_BLOOM_FILTER, > > > > > BPF_MAP_TYPE_USER_RINGBUF, > > > > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > > > }; > > > > > > > > > /* Note that tracing related programs such as > > > > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > > > > * **-E2BIG** if user-space has tried to publish a sample which is > > > > > * larger than the size of the ring buffer, or which cannot fit > > > > > * within a struct bpf_dynptr. > > > > > + * > > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > > > > *cgroup, void *value, u64 flags) > > > > > + * Description > > > > > + * Get a bpf_local_storage from the *cgroup*. > > > > > + * > > > > > + * Logically, it could be thought of as getting the value from > > > > > + * a *map* with *cgroup* as the **key**. From this > > > > > + * perspective, the usage is not much different from > > > > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > > > > + * helper enforces the key must be a cgroup struct and the map must also > > > > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > > > > + * > > > > > + * Underneath, the value is stored locally at *cgroup* instead of > > > > > + * the *map*. The *map* is used as the bpf-local-storage > > > > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > > > > + * searched against all bpf_local_storage residing at *cgroup*. > > > > > + * > > > > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > > > > + * used such that a new bpf_local_storage will be > > > > > + * created if one does not exist. *value* can be used > > > > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > > > > + * the initial value of a bpf_local_storage. If *value* is > > > > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > > > > + * Return > > > > > + * A bpf_local_storage pointer is returned on success. > > > > > + * > > > > > + * **NULL** if not found or there was an error in adding > > > > > + * a new bpf_local_storage. > > > > > + * > > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > > > > cgroup *cgroup) > > > > > + * Description > > > > > + * Delete a bpf_local_storage from a *cgroup*. > > > > > + * Return > > > > > + * 0 on success. > > > > > + * > > > > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > > > > */ > > > > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > > > > FN(unspec, 0, ##ctx) \ > > > > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > > > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > > > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > > > > FN(user_ringbuf_drain, 209, ##ctx) \ > > > > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > > > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > > > > /* */ > > > > > > > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > > > > don't > > > > > -- > > > > > 2.30.2 > > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 19:07 ` Stanislav Fomichev @ 2022-10-17 19:11 ` Yosry Ahmed 2022-10-17 19:26 ` Tejun Heo 2022-10-17 21:07 ` Martin KaFai Lau 0 siblings, 2 replies; 38+ messages in thread From: Yosry Ahmed @ 2022-10-17 19:11 UTC (permalink / raw) To: Stanislav Fomichev Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote: > > On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > > > > > > > > > > On 10/13, Yonghong Song wrote: > > > > > > Similar to sk/inode/task storage, implement similar cgroup local storage. > > > > > > > > > > > There already exists a local storage implementation for cgroup-attached > > > > > > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > > > > > > bpf_get_local_storage(). But there are use cases such that non-cgroup > > > > > > attached bpf progs wants to access cgroup local storage data. For example, > > > > > > tc egress prog has access to sk and cgroup. It is possible to use > > > > > > sk local storage to emulate cgroup local storage by storing data in > > > > > > socket. > > > > > > But this is a waste as it could be lots of sockets belonging to a > > > > > > particular > > > > > > cgroup. Alternatively, a separate map can be created with cgroup id as > > > > > > the key. > > > > > > But this will introduce additional overhead to manipulate the new map. > > > > > > A cgroup local storage, similar to existing sk/inode/task storage, > > > > > > should help for this use case. > > > > > > > > > > > The life-cycle of storage is managed with the life-cycle of the > > > > > > cgroup struct. i.e. the storage is destroyed along with the owning cgroup > > > > > > with a callback to the bpf_cgroup_storage_free when cgroup itself > > > > > > is deleted. > > > > > > > > > > > The userspace map operations can be done by using a cgroup fd as a key > > > > > > passed to the lookup, update and delete operations. > > > > > > > > > > > > > > > [..] > > > > > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > > > > > > local > > > > > > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > > > > > > used > > > > > > for cgroup storage available to non-cgroup-attached bpf programs. The two > > > > > > helpers are named as bpf_cgroup_local_storage_get() and > > > > > > bpf_cgroup_local_storage_delete(). > > > > > > > > > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make > > > > > cgroup storages shared between programs on the same cgroup") where > > > > > the map changes its behavior depending on the key size (see key_size checks > > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > > > > > can be used so we can, in theory, reuse the name.. > > > > > > > > > > Pros: > > > > > - no need for a new map name > > > > > > > > > > Cons: > > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > > > > > good idea to add more stuff to it? > > > > > > > > > > But, for the very least, should we also extend > > > > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > > > > > tried to keep some of the important details in there.. > > > > > > > > This might be a long shot, but is it possible to switch completely to > > > > this new generic cgroup storage, and for programs that attach to > > > > cgroups we can still do lookups/allocations during attachment like we > > > > do today? IOW, maintain the current API for cgroup progs but switch it > > > > to use this new map type instead. > > > > > > > > It feels like this map type is more generic and can be a superset of > > > > the existing cgroup storage, but I feel like I am missing something. > > > > > > I feel like the biggest issue is that the existing > > > bpf_get_local_storage helper is guaranteed to always return non-null > > > and the verifier doesn't require the programs to do null checks on it; > > > the new helper might return NULL making all existing programs fail the > > > verifier. > > > > What I meant is, keep the old bpf_get_local_storage helper only for > > cgroup-attached programs like we have today, and add a new generic > > bpf_cgroup_local_storage_get() helper. > > > > For cgroup-attached programs, make sure a cgroup storage entry is > > allocated and hooked to the helper on program attach time, to keep > > today's behavior constant. > > > > For other programs, the bpf_cgroup_local_storage_get() will do the > > normal lookup and allocate if necessary. > > > > Does this make any sense to you? > > But then you also need to somehow mark these to make sure it's not > possible to delete them as long as the program is loaded/attached? Not > saying it's impossible, but it's a bit of a departure from the > existing common local storage framework used by inode/task; not sure > whether we want to pull all this complexity in there? But we can > definitely try if there is a wider agreement.. I agree that it's not ideal, but it feels like we are comparing two non-ideal options anyway, I am just throwing ideas around :) > > > > There might be something else I don't remember at this point (besides > > > that weird per-prog_type that we'd have to emulate as well).. > > > > Yeah there are things that will need to be emulated, but I feel like > > we may end up with less confusing code (and less code in general). > > > > > > > > > > > > > > > > Signed-off-by: Yonghong Song <yhs@fb.com> > > > > > > --- > > > > > > include/linux/bpf.h | 3 + > > > > > > include/linux/bpf_types.h | 1 + > > > > > > include/linux/cgroup-defs.h | 4 + > > > > > > include/uapi/linux/bpf.h | 39 +++++ > > > > > > kernel/bpf/Makefile | 2 +- > > > > > > kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > > > > > > kernel/bpf/helpers.c | 6 + > > > > > > kernel/bpf/syscall.c | 3 +- > > > > > > kernel/bpf/verifier.c | 14 +- > > > > > > kernel/cgroup/cgroup.c | 4 + > > > > > > kernel/trace/bpf_trace.c | 4 + > > > > > > scripts/bpf_doc.py | 2 + > > > > > > tools/include/uapi/linux/bpf.h | 39 +++++ > > > > > > 13 files changed, 398 insertions(+), 3 deletions(-) > > > > > > create mode 100644 kernel/bpf/bpf_cgroup_storage.c > > > > > > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > > > > index 9e7d46d16032..1395a01c7f18 100644 > > > > > > --- a/include/linux/bpf.h > > > > > > +++ b/include/linux/bpf.h > > > > > > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > > > > > > > > > > > const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id > > > > > > func_id); > > > > > > void bpf_task_storage_free(struct task_struct *task); > > > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); > > > > > > bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); > > > > > > const struct btf_func_model * > > > > > > bpf_jit_find_kfunc_model(const struct bpf_prog *prog, > > > > > > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto > > > > > > bpf_copy_from_user_task_proto; > > > > > > extern const struct bpf_func_proto bpf_set_retval_proto; > > > > > > extern const struct bpf_func_proto bpf_get_retval_proto; > > > > > > extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > > > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; > > > > > > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; > > > > > > > > > > > const struct bpf_func_proto *tracing_prog_func_proto( > > > > > > enum bpf_func_id func_id, const struct bpf_prog *prog); > > > > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > > > > > > index 2c6a4f2562a7..7a0362d7a0aa 100644 > > > > > > --- a/include/linux/bpf_types.h > > > > > > +++ b/include/linux/bpf_types.h > > > > > > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, > > > > > > cgroup_array_map_ops) > > > > > > #ifdef CONFIG_CGROUP_BPF > > > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) > > > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) > > > > > > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > > > > cgroup_local_storage_map_ops) > > > > > > #endif > > > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) > > > > > > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) > > > > > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > > > > > > index 4bcf56b3491c..c6f4590dda68 100644 > > > > > > --- a/include/linux/cgroup-defs.h > > > > > > +++ b/include/linux/cgroup-defs.h > > > > > > @@ -504,6 +504,10 @@ struct cgroup { > > > > > > /* Used to store internal freezer state */ > > > > > > struct cgroup_freezer_state freezer; > > > > > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > > > + struct bpf_local_storage __rcu *bpf_cgroup_storage; > > > > > > +#endif > > > > > > + > > > > > > /* ids of the ancestors at each level including self */ > > > > > > u64 ancestor_ids[]; > > > > > > }; > > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > > > > index 17f61338f8f8..d918b4054297 100644 > > > > > > --- a/include/uapi/linux/bpf.h > > > > > > +++ b/include/uapi/linux/bpf.h > > > > > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > > > > > BPF_MAP_TYPE_TASK_STORAGE, > > > > > > BPF_MAP_TYPE_BLOOM_FILTER, > > > > > > BPF_MAP_TYPE_USER_RINGBUF, > > > > > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > > > > }; > > > > > > > > > > > /* Note that tracing related programs such as > > > > > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > > > > > * **-E2BIG** if user-space has tried to publish a sample which is > > > > > > * larger than the size of the ring buffer, or which cannot fit > > > > > > * within a struct bpf_dynptr. > > > > > > + * > > > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > > > > > *cgroup, void *value, u64 flags) > > > > > > + * Description > > > > > > + * Get a bpf_local_storage from the *cgroup*. > > > > > > + * > > > > > > + * Logically, it could be thought of as getting the value from > > > > > > + * a *map* with *cgroup* as the **key**. From this > > > > > > + * perspective, the usage is not much different from > > > > > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > > > > > + * helper enforces the key must be a cgroup struct and the map must also > > > > > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > > > > > + * > > > > > > + * Underneath, the value is stored locally at *cgroup* instead of > > > > > > + * the *map*. The *map* is used as the bpf-local-storage > > > > > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > > > > > + * searched against all bpf_local_storage residing at *cgroup*. > > > > > > + * > > > > > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > > > > > + * used such that a new bpf_local_storage will be > > > > > > + * created if one does not exist. *value* can be used > > > > > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > > > > > + * the initial value of a bpf_local_storage. If *value* is > > > > > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > > > > > + * Return > > > > > > + * A bpf_local_storage pointer is returned on success. > > > > > > + * > > > > > > + * **NULL** if not found or there was an error in adding > > > > > > + * a new bpf_local_storage. > > > > > > + * > > > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > > > > > cgroup *cgroup) > > > > > > + * Description > > > > > > + * Delete a bpf_local_storage from a *cgroup*. > > > > > > + * Return > > > > > > + * 0 on success. > > > > > > + * > > > > > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > > > > > */ > > > > > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > > > > > FN(unspec, 0, ##ctx) \ > > > > > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > > > > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > > > > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > > > > > FN(user_ringbuf_drain, 209, ##ctx) \ > > > > > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > > > > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > > > > > /* */ > > > > > > > > > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > > > > > don't > > > > > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > > > > > index 341c94f208f4..b02693f51978 100644 > > > > > > --- a/kernel/bpf/Makefile > > > > > > +++ b/kernel/bpf/Makefile > > > > > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > > > > > > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > > > > > > endif > > > > > > ifeq ($(CONFIG_CGROUPS),y) > > > > > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > > > > > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o > > > > > > endif > > > > > > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > > > > > > ifeq ($(CONFIG_INET),y) > > > > > > diff --git a/kernel/bpf/bpf_cgroup_storage.c > > > > > > b/kernel/bpf/bpf_cgroup_storage.c > > > > > > new file mode 100644 > > > > > > index 000000000000..9974784822da > > > > > > --- /dev/null > > > > > > +++ b/kernel/bpf/bpf_cgroup_storage.c > > > > > > @@ -0,0 +1,280 @@ > > > > > > +// SPDX-License-Identifier: GPL-2.0 > > > > > > +/* > > > > > > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > > > > > > + */ > > > > > > + > > > > > > +#include <linux/types.h> > > > > > > +#include <linux/bpf.h> > > > > > > +#include <linux/bpf_local_storage.h> > > > > > > +#include <uapi/linux/btf.h> > > > > > > +#include <linux/btf_ids.h> > > > > > > + > > > > > > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > > > > > > + > > > > > > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); > > > > > > + > > > > > > +static void bpf_cgroup_storage_lock(void) > > > > > > +{ > > > > > > + migrate_disable(); > > > > > > + this_cpu_inc(bpf_cgroup_storage_busy); > > > > > > +} > > > > > > + > > > > > > +static void bpf_cgroup_storage_unlock(void) > > > > > > +{ > > > > > > + this_cpu_dec(bpf_cgroup_storage_busy); > > > > > > + migrate_enable(); > > > > > > +} > > > > > > + > > > > > > +static bool bpf_cgroup_storage_trylock(void) > > > > > > +{ > > > > > > + migrate_disable(); > > > > > > + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { > > > > > > + this_cpu_dec(bpf_cgroup_storage_busy); > > > > > > + migrate_enable(); > > > > > > + return false; > > > > > > + } > > > > > > + return true; > > > > > > +} > > > > > > > > > > Task storage has lock/unlock/trylock; inode storage doesn't; why does > > > > > cgroup need it as well? > > > > > > > > > > > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > > > > > > +{ > > > > > > + struct cgroup *cg = owner; > > > > > > + > > > > > > + return &cg->bpf_cgroup_storage; > > > > > > +} > > > > > > + > > > > > > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) > > > > > > +{ > > > > > > + struct bpf_local_storage *local_storage; > > > > > > + struct bpf_local_storage_elem *selem; > > > > > > + bool free_cgroup_storage = false; > > > > > > + struct hlist_node *n; > > > > > > + unsigned long flags; > > > > > > + > > > > > > + rcu_read_lock(); > > > > > > + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); > > > > > > + if (!local_storage) { > > > > > > + rcu_read_unlock(); > > > > > > + return; > > > > > > + } > > > > > > + > > > > > > + /* Neither the bpf_prog nor the bpf-map's syscall > > > > > > + * could be modifying the local_storage->list now. > > > > > > + * Thus, no elem can be added-to or deleted-from the > > > > > > + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > > > > > > + * > > > > > > + * It is racing with bpf_local_storage_map_free() alone > > > > > > + * when unlinking elem from the local_storage->list and > > > > > > + * the map's bucket->list. > > > > > > + */ > > > > > > + bpf_cgroup_storage_lock(); > > > > > > + raw_spin_lock_irqsave(&local_storage->lock, flags); > > > > > > + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > > > > > > + bpf_selem_unlink_map(selem); > > > > > > + free_cgroup_storage = > > > > > > + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > > > > > > + } > > > > > > + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > > > > > > + bpf_cgroup_storage_unlock(); > > > > > > + rcu_read_unlock(); > > > > > > + > > > > > > + /* free_cgroup_storage should always be true as long as > > > > > > + * local_storage->list was non-empty. > > > > > > + */ > > > > > > + if (free_cgroup_storage) > > > > > > + kfree_rcu(local_storage, rcu); > > > > > > +} > > > > > > > > > > > +static struct bpf_local_storage_data * > > > > > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool > > > > > > cacheit_lockit) > > > > > > +{ > > > > > > + struct bpf_local_storage *cgroup_storage; > > > > > > + struct bpf_local_storage_map *smap; > > > > > > + > > > > > > + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, > > > > > > + bpf_rcu_lock_held()); > > > > > > + if (!cgroup_storage) > > > > > > + return NULL; > > > > > > + > > > > > > + smap = (struct bpf_local_storage_map *)map; > > > > > > + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > > > > > > +} > > > > > > + > > > > > > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void > > > > > > *key) > > > > > > +{ > > > > > > + struct bpf_local_storage_data *sdata; > > > > > > + struct cgroup *cgroup; > > > > > > + int fd; > > > > > > + > > > > > > + fd = *(int *)key; > > > > > > + cgroup = cgroup_get_from_fd(fd); > > > > > > + if (IS_ERR(cgroup)) > > > > > > + return ERR_CAST(cgroup); > > > > > > + > > > > > > + bpf_cgroup_storage_lock(); > > > > > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > > > > > + bpf_cgroup_storage_unlock(); > > > > > > + cgroup_put(cgroup); > > > > > > + return sdata ? sdata->data : NULL; > > > > > > +} > > > > > > > > > > A lot of the above (free/lookup) seems to be copy-pasted from the task > > > > > storage; > > > > > any point in trying to generalize the common parts? > > > > > > > > > > > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, > > > > > > + void *value, u64 map_flags) > > > > > > +{ > > > > > > + struct bpf_local_storage_data *sdata; > > > > > > + struct cgroup *cgroup; > > > > > > + int err, fd; > > > > > > + > > > > > > + fd = *(int *)key; > > > > > > + cgroup = cgroup_get_from_fd(fd); > > > > > > + if (IS_ERR(cgroup)) > > > > > > + return PTR_ERR(cgroup); > > > > > > + > > > > > > + bpf_cgroup_storage_lock(); > > > > > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > > > > > *)map, > > > > > > + value, map_flags, GFP_ATOMIC); > > > > > > + bpf_cgroup_storage_unlock(); > > > > > > + err = PTR_ERR_OR_ZERO(sdata); > > > > > > + cgroup_put(cgroup); > > > > > > + return err; > > > > > > +} > > > > > > + > > > > > > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map > > > > > > *map) > > > > > > +{ > > > > > > + struct bpf_local_storage_data *sdata; > > > > > > + > > > > > > + sdata = cgroup_storage_lookup(cgroup, map, false); > > > > > > + if (!sdata) > > > > > > + return -ENOENT; > > > > > > + > > > > > > + bpf_selem_unlink(SELEM(sdata), true); > > > > > > + return 0; > > > > > > +} > > > > > > + > > > > > > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key) > > > > > > +{ > > > > > > + struct cgroup *cgroup; > > > > > > + int err, fd; > > > > > > + > > > > > > + fd = *(int *)key; > > > > > > + cgroup = cgroup_get_from_fd(fd); > > > > > > + if (IS_ERR(cgroup)) > > > > > > + return PTR_ERR(cgroup); > > > > > > + > > > > > > + bpf_cgroup_storage_lock(); > > > > > > + err = cgroup_storage_delete(cgroup, map); > > > > > > + bpf_cgroup_storage_unlock(); > > > > > > + if (err) > > > > > > + return err; > > > > > > + > > > > > > + cgroup_put(cgroup); > > > > > > + return 0; > > > > > > +} > > > > > > + > > > > > > +static int notsupp_get_next_key(struct bpf_map *map, void *key, void > > > > > > *next_key) > > > > > > +{ > > > > > > + return -ENOTSUPP; > > > > > > +} > > > > > > + > > > > > > +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) > > > > > > +{ > > > > > > + struct bpf_local_storage_map *smap; > > > > > > + > > > > > > + smap = bpf_local_storage_map_alloc(attr); > > > > > > + if (IS_ERR(smap)) > > > > > > + return ERR_CAST(smap); > > > > > > + > > > > > > + smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache); > > > > > > + return &smap->map; > > > > > > +} > > > > > > + > > > > > > +static void cgroup_storage_map_free(struct bpf_map *map) > > > > > > +{ > > > > > > + struct bpf_local_storage_map *smap; > > > > > > + > > > > > > + smap = (struct bpf_local_storage_map *)map; > > > > > > + bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx); > > > > > > + bpf_local_storage_map_free(smap, NULL); > > > > > > +} > > > > > > + > > > > > > +/* *gfp_flags* is a hidden argument provided by the verifier */ > > > > > > +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct cgroup > > > > > > *, cgroup, > > > > > > + void *, value, u64, flags, gfp_t, gfp_flags) > > > > > > +{ > > > > > > + struct bpf_local_storage_data *sdata; > > > > > > + > > > > > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > > > > > + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) > > > > > > + return (unsigned long)NULL; > > > > > > + > > > > > > + if (!cgroup) > > > > > > + return (unsigned long)NULL; > > > > > > + > > > > > > + if (!bpf_cgroup_storage_trylock()) > > > > > > + return (unsigned long)NULL; > > > > > > + > > > > > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > > > > > + if (sdata) > > > > > > + goto unlock; > > > > > > + > > > > > > + /* only allocate new storage, when the cgroup is refcounted */ > > > > > > + if (!percpu_ref_is_dying(&cgroup->self.refcnt) && > > > > > > + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) > > > > > > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > > > > > > *)map, > > > > > > + value, BPF_NOEXIST, gfp_flags); > > > > > > + > > > > > > +unlock: > > > > > > + bpf_cgroup_storage_unlock(); > > > > > > + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned > > > > > > long)sdata->data; > > > > > > +} > > > > > > + > > > > > > +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct > > > > > > cgroup *, cgroup) > > > > > > +{ > > > > > > + int ret; > > > > > > + > > > > > > + WARN_ON_ONCE(!bpf_rcu_lock_held()); > > > > > > + if (!cgroup) > > > > > > + return -EINVAL; > > > > > > + > > > > > > + if (!bpf_cgroup_storage_trylock()) > > > > > > + return -EBUSY; > > > > > > + > > > > > > + ret = cgroup_storage_delete(cgroup, map); > > > > > > + bpf_cgroup_storage_unlock(); > > > > > > + return ret; > > > > > > +} > > > > > > + > > > > > > +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, > > > > > > bpf_local_storage_map) > > > > > > +const struct bpf_map_ops cgroup_local_storage_map_ops = { > > > > > > + .map_meta_equal = bpf_map_meta_equal, > > > > > > + .map_alloc_check = bpf_local_storage_map_alloc_check, > > > > > > + .map_alloc = cgroup_storage_map_alloc, > > > > > > + .map_free = cgroup_storage_map_free, > > > > > > + .map_get_next_key = notsupp_get_next_key, > > > > > > + .map_lookup_elem = bpf_cgroup_storage_lookup_elem, > > > > > > + .map_update_elem = bpf_cgroup_storage_update_elem, > > > > > > + .map_delete_elem = bpf_cgroup_storage_delete_elem, > > > > > > + .map_check_btf = bpf_local_storage_map_check_btf, > > > > > > + .map_btf_id = &cgroup_storage_map_btf_ids[0], > > > > > > + .map_owner_storage_ptr = cgroup_storage_ptr, > > > > > > +}; > > > > > > + > > > > > > +const struct bpf_func_proto bpf_cgroup_storage_get_proto = { > > > > > > + .func = bpf_cgroup_storage_get, > > > > > > + .gpl_only = false, > > > > > > + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, > > > > > > + .arg1_type = ARG_CONST_MAP_PTR, > > > > > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > > > > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > > > > > + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, > > > > > > + .arg4_type = ARG_ANYTHING, > > > > > > +}; > > > > > > + > > > > > > +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = { > > > > > > + .func = bpf_cgroup_storage_delete, > > > > > > + .gpl_only = false, > > > > > > + .ret_type = RET_INTEGER, > > > > > > + .arg1_type = ARG_CONST_MAP_PTR, > > > > > > + .arg2_type = ARG_PTR_TO_BTF_ID, > > > > > > + .arg2_btf_id = &bpf_cgroup_btf_id[0], > > > > > > +}; > > > > > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > > > > index a6b04faed282..5c5bb08832ec 100644 > > > > > > --- a/kernel/bpf/helpers.c > > > > > > +++ b/kernel/bpf/helpers.c > > > > > > @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) > > > > > > return &bpf_dynptr_write_proto; > > > > > > case BPF_FUNC_dynptr_data: > > > > > > return &bpf_dynptr_data_proto; > > > > > > +#ifdef CONFIG_CGROUPS > > > > > > + case BPF_FUNC_cgroup_local_storage_get: > > > > > > + return &bpf_cgroup_storage_get_proto; > > > > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > > > > + return &bpf_cgroup_storage_delete_proto; > > > > > > +#endif > > > > > > default: > > > > > > break; > > > > > > } > > > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > > > > index 7b373a5e861f..e53c7fae6e22 100644 > > > > > > --- a/kernel/bpf/syscall.c > > > > > > +++ b/kernel/bpf/syscall.c > > > > > > @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const > > > > > > struct btf *btf, > > > > > > map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && > > > > > > map->map_type != BPF_MAP_TYPE_SK_STORAGE && > > > > > > map->map_type != BPF_MAP_TYPE_INODE_STORAGE && > > > > > > - map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > > > > > + map->map_type != BPF_MAP_TYPE_TASK_STORAGE && > > > > > > + map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > > > > > return -ENOTSUPP; > > > > > > if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > > > > > > > map->value_size) { > > > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > > > > > index 6f6d2d511c06..f36f6a3c0d50 100644 > > > > > > --- a/kernel/bpf/verifier.c > > > > > > +++ b/kernel/bpf/verifier.c > > > > > > @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct > > > > > > bpf_verifier_env *env, > > > > > > func_id != BPF_FUNC_task_storage_delete) > > > > > > goto error; > > > > > > break; > > > > > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > > > > > + if (func_id != BPF_FUNC_cgroup_local_storage_get && > > > > > > + func_id != BPF_FUNC_cgroup_local_storage_delete) > > > > > > + goto error; > > > > > > + break; > > > > > > case BPF_MAP_TYPE_BLOOM_FILTER: > > > > > > if (func_id != BPF_FUNC_map_peek_elem && > > > > > > func_id != BPF_FUNC_map_push_elem) > > > > > > @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct > > > > > > bpf_verifier_env *env, > > > > > > if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > > > > > > goto error; > > > > > > break; > > > > > > + case BPF_FUNC_cgroup_local_storage_get: > > > > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > > > > + if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > > > > > > + goto error; > > > > > > + break; > > > > > > default: > > > > > > break; > > > > > > } > > > > > > @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct > > > > > > bpf_verifier_env *env, > > > > > > case BPF_MAP_TYPE_INODE_STORAGE: > > > > > > case BPF_MAP_TYPE_SK_STORAGE: > > > > > > case BPF_MAP_TYPE_TASK_STORAGE: > > > > > > + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > > > > > > break; > > > > > > default: > > > > > > verbose(env, > > > > > > @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct bpf_verifier_env > > > > > > *env) > > > > > > > > > > > if (insn->imm == BPF_FUNC_task_storage_get || > > > > > > insn->imm == BPF_FUNC_sk_storage_get || > > > > > > - insn->imm == BPF_FUNC_inode_storage_get) { > > > > > > + insn->imm == BPF_FUNC_inode_storage_get || > > > > > > + insn->imm == BPF_FUNC_cgroup_local_storage_get) { > > > > > > if (env->prog->aux->sleepable) > > > > > > insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL); > > > > > > else > > > > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > > > > > index 8ad2c267ff47..2fa2c950c7fb 100644 > > > > > > --- a/kernel/cgroup/cgroup.c > > > > > > +++ b/kernel/cgroup/cgroup.c > > > > > > @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) > > > > > > put_css_set_locked(cset->dom_cset); > > > > > > } > > > > > > > > > > > +#ifdef CONFIG_BPF_SYSCALL > > > > > > + bpf_local_cgroup_storage_free(cset->dfl_cgrp); > > > > > > +#endif > > > > > > + > > > > > > > > I am confused about this freeing site. It seems like this path is for > > > > freeing css_set's of task_structs, not for freeing the cgroup itself. > > > > Wouldn't we want to free the local storage when we free the cgroup > > > > itself? Somewhere like css_free_rwork_fn()? or did I completely miss > > > > the point here? > > > > > > > > > > kfree_rcu(cset, rcu_head); > > > > > > } > > > > > > > > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > > > > > index 688552df95ca..179adaae4a9f 100644 > > > > > > --- a/kernel/trace/bpf_trace.c > > > > > > +++ b/kernel/trace/bpf_trace.c > > > > > > @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, > > > > > > const struct bpf_prog *prog) > > > > > > return &bpf_get_current_cgroup_id_proto; > > > > > > case BPF_FUNC_get_current_ancestor_cgroup_id: > > > > > > return &bpf_get_current_ancestor_cgroup_id_proto; > > > > > > + case BPF_FUNC_cgroup_local_storage_get: > > > > > > + return &bpf_cgroup_storage_get_proto; > > > > > > + case BPF_FUNC_cgroup_local_storage_delete: > > > > > > + return &bpf_cgroup_storage_delete_proto; > > > > > > #endif > > > > > > case BPF_FUNC_send_signal: > > > > > > return &bpf_send_signal_proto; > > > > > > diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py > > > > > > index c0e6690be82a..fdb0aff8cb5a 100755 > > > > > > --- a/scripts/bpf_doc.py > > > > > > +++ b/scripts/bpf_doc.py > > > > > > @@ -685,6 +685,7 @@ class PrinterHelpers(Printer): > > > > > > 'struct udp6_sock', > > > > > > 'struct unix_sock', > > > > > > 'struct task_struct', > > > > > > + 'struct cgroup', > > > > > > > > > > > 'struct __sk_buff', > > > > > > 'struct sk_msg_md', > > > > > > @@ -742,6 +743,7 @@ class PrinterHelpers(Printer): > > > > > > 'struct udp6_sock', > > > > > > 'struct unix_sock', > > > > > > 'struct task_struct', > > > > > > + 'struct cgroup', > > > > > > 'struct path', > > > > > > 'struct btf_ptr', > > > > > > 'struct inode', > > > > > > diff --git a/tools/include/uapi/linux/bpf.h > > > > > > b/tools/include/uapi/linux/bpf.h > > > > > > index 17f61338f8f8..d918b4054297 100644 > > > > > > --- a/tools/include/uapi/linux/bpf.h > > > > > > +++ b/tools/include/uapi/linux/bpf.h > > > > > > @@ -935,6 +935,7 @@ enum bpf_map_type { > > > > > > BPF_MAP_TYPE_TASK_STORAGE, > > > > > > BPF_MAP_TYPE_BLOOM_FILTER, > > > > > > BPF_MAP_TYPE_USER_RINGBUF, > > > > > > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > > > > > > }; > > > > > > > > > > > /* Note that tracing related programs such as > > > > > > @@ -5435,6 +5436,42 @@ union bpf_attr { > > > > > > * **-E2BIG** if user-space has tried to publish a sample which is > > > > > > * larger than the size of the ring buffer, or which cannot fit > > > > > > * within a struct bpf_dynptr. > > > > > > + * > > > > > > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup > > > > > > *cgroup, void *value, u64 flags) > > > > > > + * Description > > > > > > + * Get a bpf_local_storage from the *cgroup*. > > > > > > + * > > > > > > + * Logically, it could be thought of as getting the value from > > > > > > + * a *map* with *cgroup* as the **key**. From this > > > > > > + * perspective, the usage is not much different from > > > > > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > > > > > + * helper enforces the key must be a cgroup struct and the map must also > > > > > > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > > > > > > + * > > > > > > + * Underneath, the value is stored locally at *cgroup* instead of > > > > > > + * the *map*. The *map* is used as the bpf-local-storage > > > > > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > > > > > + * searched against all bpf_local_storage residing at *cgroup*. > > > > > > + * > > > > > > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > > > > > > + * used such that a new bpf_local_storage will be > > > > > > + * created if one does not exist. *value* can be used > > > > > > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > > > > > > + * the initial value of a bpf_local_storage. If *value* is > > > > > > + * **NULL**, the new bpf_local_storage will be zero initialized. > > > > > > + * Return > > > > > > + * A bpf_local_storage pointer is returned on success. > > > > > > + * > > > > > > + * **NULL** if not found or there was an error in adding > > > > > > + * a new bpf_local_storage. > > > > > > + * > > > > > > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > > > > > > cgroup *cgroup) > > > > > > + * Description > > > > > > + * Delete a bpf_local_storage from a *cgroup*. > > > > > > + * Return > > > > > > + * 0 on success. > > > > > > + * > > > > > > + * **-ENOENT** if the bpf_local_storage cannot be found. > > > > > > */ > > > > > > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > > > > > > FN(unspec, 0, ##ctx) \ > > > > > > @@ -5647,6 +5684,8 @@ union bpf_attr { > > > > > > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > > > > > > FN(ktime_get_tai_ns, 208, ##ctx) \ > > > > > > FN(user_ringbuf_drain, 209, ##ctx) \ > > > > > > + FN(cgroup_local_storage_get, 210, ##ctx) \ > > > > > > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > > > > > > /* */ > > > > > > > > > > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that > > > > > > don't > > > > > > -- > > > > > > 2.30.2 > > > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 19:11 ` Yosry Ahmed @ 2022-10-17 19:26 ` Tejun Heo 2022-10-17 21:07 ` Martin KaFai Lau 1 sibling, 0 replies; 38+ messages in thread From: Tejun Heo @ 2022-10-17 19:26 UTC (permalink / raw) To: Yosry Ahmed Cc: Stanislav Fomichev, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau Hello, On Mon, Oct 17, 2022 at 12:11:55PM -0700, Yosry Ahmed wrote: > I agree that it's not ideal, but it feels like we are comparing two > non-ideal options anyway, I am just throwing ideas around :) In the spirit of throwing ideas around, I wonder whether the better way to about it is keeping them separate with clear documentation and figure out a way to deprecate the old one as AFAICS the new one should be able to do everything the old one was doing. Would it be an option to, say, make the verifier warn the users towards converting to the new one and eventually remove the old one down the line? Thanks. -- tejun ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 19:11 ` Yosry Ahmed 2022-10-17 19:26 ` Tejun Heo @ 2022-10-17 21:07 ` Martin KaFai Lau 2022-10-17 21:23 ` Yosry Ahmed 2022-10-17 22:16 ` sdf 1 sibling, 2 replies; 38+ messages in thread From: Martin KaFai Lau @ 2022-10-17 21:07 UTC (permalink / raw) To: Yosry Ahmed, Yonghong Song, Stanislav Fomichev Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 12:11 PM, Yosry Ahmed wrote: > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote: >> >> On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote: >>> >>> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: >>>> >>>> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: >>>>> >>>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: >>>>>> >>>>>> On 10/13, Yonghong Song wrote: >>>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage. >>>>>> >>>>>>> There already exists a local storage implementation for cgroup-attached >>>>>>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper >>>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup >>>>>>> attached bpf progs wants to access cgroup local storage data. For example, >>>>>>> tc egress prog has access to sk and cgroup. It is possible to use >>>>>>> sk local storage to emulate cgroup local storage by storing data in >>>>>>> socket. >>>>>>> But this is a waste as it could be lots of sockets belonging to a >>>>>>> particular >>>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as >>>>>>> the key. >>>>>>> But this will introduce additional overhead to manipulate the new map. >>>>>>> A cgroup local storage, similar to existing sk/inode/task storage, >>>>>>> should help for this use case. >>>>>> >>>>>>> The life-cycle of storage is managed with the life-cycle of the >>>>>>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup >>>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself >>>>>>> is deleted. >>>>>> >>>>>>> The userspace map operations can be done by using a cgroup fd as a key >>>>>>> passed to the lookup, update and delete operations. >>>>>> >>>>>> >>>>>> [..] >>>>>> >>>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup >>>>>>> local >>>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is >>>>>>> used >>>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two >>>>>>> helpers are named as bpf_cgroup_local_storage_get() and >>>>>>> bpf_cgroup_local_storage_delete(). >>>>>> >>>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make >>>>>> cgroup storages shared between programs on the same cgroup") where >>>>>> the map changes its behavior depending on the key size (see key_size checks >>>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still >>>>>> can be used so we can, in theory, reuse the name.. >>>>>> >>>>>> Pros: >>>>>> - no need for a new map name >>>>>> >>>>>> Cons: >>>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a >>>>>> good idea to add more stuff to it? >>>>>> >>>>>> But, for the very least, should we also extend >>>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've >>>>>> tried to keep some of the important details in there.. >>>>> >>>>> This might be a long shot, but is it possible to switch completely to >>>>> this new generic cgroup storage, and for programs that attach to >>>>> cgroups we can still do lookups/allocations during attachment like we >>>>> do today? IOW, maintain the current API for cgroup progs but switch it >>>>> to use this new map type instead. >>>>> >>>>> It feels like this map type is more generic and can be a superset of >>>>> the existing cgroup storage, but I feel like I am missing something. >>>> >>>> I feel like the biggest issue is that the existing >>>> bpf_get_local_storage helper is guaranteed to always return non-null >>>> and the verifier doesn't require the programs to do null checks on it; >>>> the new helper might return NULL making all existing programs fail the >>>> verifier. >>> >>> What I meant is, keep the old bpf_get_local_storage helper only for >>> cgroup-attached programs like we have today, and add a new generic >>> bpf_cgroup_local_storage_get() helper. >>> >>> For cgroup-attached programs, make sure a cgroup storage entry is >>> allocated and hooked to the helper on program attach time, to keep >>> today's behavior constant. >>> >>> For other programs, the bpf_cgroup_local_storage_get() will do the >>> normal lookup and allocate if necessary. >>> >>> Does this make any sense to you? >> >> But then you also need to somehow mark these to make sure it's not >> possible to delete them as long as the program is loaded/attached? Not >> saying it's impossible, but it's a bit of a departure from the >> existing common local storage framework used by inode/task; not sure >> whether we want to pull all this complexity in there? But we can >> definitely try if there is a wider agreement.. > > I agree that it's not ideal, but it feels like we are comparing two > non-ideal options anyway, I am just throwing ideas around :) I don't think it is a good idea to marry the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing BPF_MAP_TYPE_CGROUP_STORAGE in any way. The API is very different. A few have already been mentioned here. Delete is one. Storage creation time is another one. The map key is also different. Yes, maybe we can reuse the different key size concept in bpf_cgroup_storage_key in some way but still feel too much unnecessary quirks for the existing sk/inode/task storage users to remember. imo, it is better to keep them separate and have a different map-type. Adding a map flag or using map extra will make it sounds like an extension which it is not. >> >>>> There might be something else I don't remember at this point (besides >>>> that weird per-prog_type that we'd have to emulate as well).. >>> >>> Yeah there are things that will need to be emulated, but I feel like >>> we may end up with less confusing code (and less code in general). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 21:07 ` Martin KaFai Lau @ 2022-10-17 21:23 ` Yosry Ahmed 2022-10-17 23:55 ` Martin KaFai Lau 2022-10-17 22:16 ` sdf 1 sibling, 1 reply; 38+ messages in thread From: Yosry Ahmed @ 2022-10-17 21:23 UTC (permalink / raw) To: Martin KaFai Lau Cc: Yonghong Song, Stanislav Fomichev, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 2:07 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 10/17/22 12:11 PM, Yosry Ahmed wrote: > > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote: > >> > >> On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >>> > >>> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: > >>>> > >>>> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >>>>> > >>>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > >>>>>> > >>>>>> On 10/13, Yonghong Song wrote: > >>>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage. > >>>>>> > >>>>>>> There already exists a local storage implementation for cgroup-attached > >>>>>>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > >>>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup > >>>>>>> attached bpf progs wants to access cgroup local storage data. For example, > >>>>>>> tc egress prog has access to sk and cgroup. It is possible to use > >>>>>>> sk local storage to emulate cgroup local storage by storing data in > >>>>>>> socket. > >>>>>>> But this is a waste as it could be lots of sockets belonging to a > >>>>>>> particular > >>>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as > >>>>>>> the key. > >>>>>>> But this will introduce additional overhead to manipulate the new map. > >>>>>>> A cgroup local storage, similar to existing sk/inode/task storage, > >>>>>>> should help for this use case. > >>>>>> > >>>>>>> The life-cycle of storage is managed with the life-cycle of the > >>>>>>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup > >>>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself > >>>>>>> is deleted. > >>>>>> > >>>>>>> The userspace map operations can be done by using a cgroup fd as a key > >>>>>>> passed to the lookup, update and delete operations. > >>>>>> > >>>>>> > >>>>>> [..] > >>>>>> > >>>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > >>>>>>> local > >>>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > >>>>>>> used > >>>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two > >>>>>>> helpers are named as bpf_cgroup_local_storage_get() and > >>>>>>> bpf_cgroup_local_storage_delete(). > >>>>>> > >>>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make > >>>>>> cgroup storages shared between programs on the same cgroup") where > >>>>>> the map changes its behavior depending on the key size (see key_size checks > >>>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > >>>>>> can be used so we can, in theory, reuse the name.. > >>>>>> > >>>>>> Pros: > >>>>>> - no need for a new map name > >>>>>> > >>>>>> Cons: > >>>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > >>>>>> good idea to add more stuff to it? > >>>>>> > >>>>>> But, for the very least, should we also extend > >>>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > >>>>>> tried to keep some of the important details in there.. > >>>>> > >>>>> This might be a long shot, but is it possible to switch completely to > >>>>> this new generic cgroup storage, and for programs that attach to > >>>>> cgroups we can still do lookups/allocations during attachment like we > >>>>> do today? IOW, maintain the current API for cgroup progs but switch it > >>>>> to use this new map type instead. > >>>>> > >>>>> It feels like this map type is more generic and can be a superset of > >>>>> the existing cgroup storage, but I feel like I am missing something. > >>>> > >>>> I feel like the biggest issue is that the existing > >>>> bpf_get_local_storage helper is guaranteed to always return non-null > >>>> and the verifier doesn't require the programs to do null checks on it; > >>>> the new helper might return NULL making all existing programs fail the > >>>> verifier. > >>> > >>> What I meant is, keep the old bpf_get_local_storage helper only for > >>> cgroup-attached programs like we have today, and add a new generic > >>> bpf_cgroup_local_storage_get() helper. > >>> > >>> For cgroup-attached programs, make sure a cgroup storage entry is > >>> allocated and hooked to the helper on program attach time, to keep > >>> today's behavior constant. > >>> > >>> For other programs, the bpf_cgroup_local_storage_get() will do the > >>> normal lookup and allocate if necessary. > >>> > >>> Does this make any sense to you? > >> > >> But then you also need to somehow mark these to make sure it's not > >> possible to delete them as long as the program is loaded/attached? Not > >> saying it's impossible, but it's a bit of a departure from the > >> existing common local storage framework used by inode/task; not sure > >> whether we want to pull all this complexity in there? But we can > >> definitely try if there is a wider agreement.. > > > > I agree that it's not ideal, but it feels like we are comparing two > > non-ideal options anyway, I am just throwing ideas around :) > > I don't think it is a good idea to marry the new > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing BPF_MAP_TYPE_CGROUP_STORAGE > in any way. The API is very different. A few have already been mentioned here. > Delete is one. Storage creation time is another one. The map key is also > different. Yes, maybe we can reuse the different key size concept in > bpf_cgroup_storage_key in some way but still feel too much unnecessary quirks > for the existing sk/inode/task storage users to remember. > > imo, it is better to keep them separate and have a different map-type. Adding a > map flag or using map extra will make it sounds like an extension which it is not. I was actually proposing considering the existing cgroup storage as an extension to the new cgroup local storage. Basically the new cgroup local storage is a generic cgroup-indexed map, and for cgroup-attached programs they get some nice extensions, such as preallocation (create local storage on attachment) and fast lookups (stash a pointer to the attached cgroup storage for direct access). There are, of course, some quirks, but it felt to me like something that is easier to reason about, and less code to maintain. For the helpers, we can maintain the existing one and generalize it (get the local storage for my cgroup), and add a new one that we pass the cgroup into (as in this patch). My idea is not to have a different flag or key size, but just basically rework the existing cgroup storage as an extension to the new one for cgroup-attached programs. Anyway, like I said I was just throwing ideas around, you have a lot more background here than me :) > > >> > >>>> There might be something else I don't remember at this point (besides > >>>> that weird per-prog_type that we'd have to emulate as well).. > >>> > >>> Yeah there are things that will need to be emulated, but I feel like > >>> we may end up with less confusing code (and less code in general). > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 21:23 ` Yosry Ahmed @ 2022-10-17 23:55 ` Martin KaFai Lau 2022-10-18 0:47 ` Yosry Ahmed 0 siblings, 1 reply; 38+ messages in thread From: Martin KaFai Lau @ 2022-10-17 23:55 UTC (permalink / raw) To: Yosry Ahmed Cc: Yonghong Song, Stanislav Fomichev, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 2:23 PM, Yosry Ahmed wrote: > On Mon, Oct 17, 2022 at 2:07 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: >> >> On 10/17/22 12:11 PM, Yosry Ahmed wrote: >>> On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote: >>>> >>>> On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote: >>>>> >>>>> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: >>>>>> >>>>>> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: >>>>>>> >>>>>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: >>>>>>>> >>>>>>>> On 10/13, Yonghong Song wrote: >>>>>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage. >>>>>>>> >>>>>>>>> There already exists a local storage implementation for cgroup-attached >>>>>>>>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper >>>>>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup >>>>>>>>> attached bpf progs wants to access cgroup local storage data. For example, >>>>>>>>> tc egress prog has access to sk and cgroup. It is possible to use >>>>>>>>> sk local storage to emulate cgroup local storage by storing data in >>>>>>>>> socket. >>>>>>>>> But this is a waste as it could be lots of sockets belonging to a >>>>>>>>> particular >>>>>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as >>>>>>>>> the key. >>>>>>>>> But this will introduce additional overhead to manipulate the new map. >>>>>>>>> A cgroup local storage, similar to existing sk/inode/task storage, >>>>>>>>> should help for this use case. >>>>>>>> >>>>>>>>> The life-cycle of storage is managed with the life-cycle of the >>>>>>>>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup >>>>>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself >>>>>>>>> is deleted. >>>>>>>> >>>>>>>>> The userspace map operations can be done by using a cgroup fd as a key >>>>>>>>> passed to the lookup, update and delete operations. >>>>>>>> >>>>>>>> >>>>>>>> [..] >>>>>>>> >>>>>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup >>>>>>>>> local >>>>>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is >>>>>>>>> used >>>>>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two >>>>>>>>> helpers are named as bpf_cgroup_local_storage_get() and >>>>>>>>> bpf_cgroup_local_storage_delete(). >>>>>>>> >>>>>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make >>>>>>>> cgroup storages shared between programs on the same cgroup") where >>>>>>>> the map changes its behavior depending on the key size (see key_size checks >>>>>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still >>>>>>>> can be used so we can, in theory, reuse the name.. >>>>>>>> >>>>>>>> Pros: >>>>>>>> - no need for a new map name >>>>>>>> >>>>>>>> Cons: >>>>>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a >>>>>>>> good idea to add more stuff to it? >>>>>>>> >>>>>>>> But, for the very least, should we also extend >>>>>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've >>>>>>>> tried to keep some of the important details in there.. >>>>>>> >>>>>>> This might be a long shot, but is it possible to switch completely to >>>>>>> this new generic cgroup storage, and for programs that attach to >>>>>>> cgroups we can still do lookups/allocations during attachment like we >>>>>>> do today? IOW, maintain the current API for cgroup progs but switch it >>>>>>> to use this new map type instead. >>>>>>> >>>>>>> It feels like this map type is more generic and can be a superset of >>>>>>> the existing cgroup storage, but I feel like I am missing something. >>>>>> >>>>>> I feel like the biggest issue is that the existing >>>>>> bpf_get_local_storage helper is guaranteed to always return non-null >>>>>> and the verifier doesn't require the programs to do null checks on it; >>>>>> the new helper might return NULL making all existing programs fail the >>>>>> verifier. >>>>> >>>>> What I meant is, keep the old bpf_get_local_storage helper only for >>>>> cgroup-attached programs like we have today, and add a new generic >>>>> bpf_cgroup_local_storage_get() helper. >>>>> >>>>> For cgroup-attached programs, make sure a cgroup storage entry is >>>>> allocated and hooked to the helper on program attach time, to keep >>>>> today's behavior constant. >>>>> >>>>> For other programs, the bpf_cgroup_local_storage_get() will do the >>>>> normal lookup and allocate if necessary. >>>>> >>>>> Does this make any sense to you? >>>> >>>> But then you also need to somehow mark these to make sure it's not >>>> possible to delete them as long as the program is loaded/attached? Not >>>> saying it's impossible, but it's a bit of a departure from the >>>> existing common local storage framework used by inode/task; not sure >>>> whether we want to pull all this complexity in there? But we can >>>> definitely try if there is a wider agreement.. >>> >>> I agree that it's not ideal, but it feels like we are comparing two >>> non-ideal options anyway, I am just throwing ideas around :) >> >> I don't think it is a good idea to marry the new >> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing BPF_MAP_TYPE_CGROUP_STORAGE >> in any way. The API is very different. A few have already been mentioned here. >> Delete is one. Storage creation time is another one. The map key is also >> different. Yes, maybe we can reuse the different key size concept in >> bpf_cgroup_storage_key in some way but still feel too much unnecessary quirks >> for the existing sk/inode/task storage users to remember. >> >> imo, it is better to keep them separate and have a different map-type. Adding a >> map flag or using map extra will make it sounds like an extension which it is not. > > I was actually proposing considering the existing cgroup storage as an > extension to the new cgroup local storage. Basically the new cgroup > local storage is a generic cgroup-indexed map, and for cgroup-attached > programs they get some nice extensions, such as preallocation (create > local storage on attachment) and fast lookups (stash a pointer to the > attached cgroup storage for direct access). There are, of course, some > quirks, but it felt to me like something that is easier to reason > about, and less code to maintain Like extending the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE implementation and add codes to make it work like the existing BPF_MAP_TYPE_CGROUP_STORAGE such that those existing code can go away? hmm..... A quick thought is it probably does not worth it for the code removal purpose alone. If all use cases can be satisfied by the BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, retiring the existing one eventually may be a cleaner answer instead of re-factoring it. Pre-allocation could be useful. The user space can do it by using bpf_map_update_elem syscall with the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE before attaching the program. For fast-lookup/stash pointer, yes, the current limitation on a bpf prog can use only one BPF_MAP_TYPE_CGROUP_STORAGE makes this easier. However, afaik, the existing bpf_get_local_storage() is also doing current->bpf_ctx->prog_item->cgroup_storage. It is not clear to me which one may be faster though. Need a micro benchmark to tell. Also, there are quite many code in local_storage.c. Not sure all of them makes sense for the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE to support. eg. ".map_get_next_key = cgroup_storage_get_next_key". The new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE does not support iteration from the user space because it has bpf_iter that supports iteration by a bpf prog which can directly get to the kernel ptr (task/sk/...) instead of a fd. In the future, we will add feature to bpf_local_storage.c that will work for all kernel objects whenever possible. eg. Adding map-in-map in the sk/inode/task/cgroup local storage, and store a ring-buf map to the sk (eg) storage. The inner map may not always make sense to be created during the cgroup-attach time and it will be another exception to make for the alloc-during-cgroup-attach behavior. > > For the helpers, we can maintain the existing one and generalize it > (get the local storage for my cgroup), and add a new one that we pass > the cgroup into (as in this patch). > > My idea is not to have a different flag or key size, but just > basically rework the existing cgroup storage as an extension to the > new one for cgroup-attached programs. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 23:55 ` Martin KaFai Lau @ 2022-10-18 0:47 ` Yosry Ahmed 0 siblings, 0 replies; 38+ messages in thread From: Yosry Ahmed @ 2022-10-18 0:47 UTC (permalink / raw) To: Martin KaFai Lau Cc: Yonghong Song, Stanislav Fomichev, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 4:55 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 10/17/22 2:23 PM, Yosry Ahmed wrote: > > On Mon, Oct 17, 2022 at 2:07 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > >> > >> On 10/17/22 12:11 PM, Yosry Ahmed wrote: > >>> On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote: > >>>> > >>>> On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >>>>> > >>>>> On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: > >>>>>> > >>>>>> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >>>>>>> > >>>>>>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > >>>>>>>> > >>>>>>>> On 10/13, Yonghong Song wrote: > >>>>>>>>> Similar to sk/inode/task storage, implement similar cgroup local storage. > >>>>>>>> > >>>>>>>>> There already exists a local storage implementation for cgroup-attached > >>>>>>>>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > >>>>>>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup > >>>>>>>>> attached bpf progs wants to access cgroup local storage data. For example, > >>>>>>>>> tc egress prog has access to sk and cgroup. It is possible to use > >>>>>>>>> sk local storage to emulate cgroup local storage by storing data in > >>>>>>>>> socket. > >>>>>>>>> But this is a waste as it could be lots of sockets belonging to a > >>>>>>>>> particular > >>>>>>>>> cgroup. Alternatively, a separate map can be created with cgroup id as > >>>>>>>>> the key. > >>>>>>>>> But this will introduce additional overhead to manipulate the new map. > >>>>>>>>> A cgroup local storage, similar to existing sk/inode/task storage, > >>>>>>>>> should help for this use case. > >>>>>>>> > >>>>>>>>> The life-cycle of storage is managed with the life-cycle of the > >>>>>>>>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup > >>>>>>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself > >>>>>>>>> is deleted. > >>>>>>>> > >>>>>>>>> The userspace map operations can be done by using a cgroup fd as a key > >>>>>>>>> passed to the lookup, update and delete operations. > >>>>>>>> > >>>>>>>> > >>>>>>>> [..] > >>>>>>>> > >>>>>>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > >>>>>>>>> local > >>>>>>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > >>>>>>>>> used > >>>>>>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two > >>>>>>>>> helpers are named as bpf_cgroup_local_storage_get() and > >>>>>>>>> bpf_cgroup_local_storage_delete(). > >>>>>>>> > >>>>>>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make > >>>>>>>> cgroup storages shared between programs on the same cgroup") where > >>>>>>>> the map changes its behavior depending on the key size (see key_size checks > >>>>>>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > >>>>>>>> can be used so we can, in theory, reuse the name.. > >>>>>>>> > >>>>>>>> Pros: > >>>>>>>> - no need for a new map name > >>>>>>>> > >>>>>>>> Cons: > >>>>>>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > >>>>>>>> good idea to add more stuff to it? > >>>>>>>> > >>>>>>>> But, for the very least, should we also extend > >>>>>>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > >>>>>>>> tried to keep some of the important details in there.. > >>>>>>> > >>>>>>> This might be a long shot, but is it possible to switch completely to > >>>>>>> this new generic cgroup storage, and for programs that attach to > >>>>>>> cgroups we can still do lookups/allocations during attachment like we > >>>>>>> do today? IOW, maintain the current API for cgroup progs but switch it > >>>>>>> to use this new map type instead. > >>>>>>> > >>>>>>> It feels like this map type is more generic and can be a superset of > >>>>>>> the existing cgroup storage, but I feel like I am missing something. > >>>>>> > >>>>>> I feel like the biggest issue is that the existing > >>>>>> bpf_get_local_storage helper is guaranteed to always return non-null > >>>>>> and the verifier doesn't require the programs to do null checks on it; > >>>>>> the new helper might return NULL making all existing programs fail the > >>>>>> verifier. > >>>>> > >>>>> What I meant is, keep the old bpf_get_local_storage helper only for > >>>>> cgroup-attached programs like we have today, and add a new generic > >>>>> bpf_cgroup_local_storage_get() helper. > >>>>> > >>>>> For cgroup-attached programs, make sure a cgroup storage entry is > >>>>> allocated and hooked to the helper on program attach time, to keep > >>>>> today's behavior constant. > >>>>> > >>>>> For other programs, the bpf_cgroup_local_storage_get() will do the > >>>>> normal lookup and allocate if necessary. > >>>>> > >>>>> Does this make any sense to you? > >>>> > >>>> But then you also need to somehow mark these to make sure it's not > >>>> possible to delete them as long as the program is loaded/attached? Not > >>>> saying it's impossible, but it's a bit of a departure from the > >>>> existing common local storage framework used by inode/task; not sure > >>>> whether we want to pull all this complexity in there? But we can > >>>> definitely try if there is a wider agreement.. > >>> > >>> I agree that it's not ideal, but it feels like we are comparing two > >>> non-ideal options anyway, I am just throwing ideas around :) > >> > >> I don't think it is a good idea to marry the new > >> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing BPF_MAP_TYPE_CGROUP_STORAGE > >> in any way. The API is very different. A few have already been mentioned here. > >> Delete is one. Storage creation time is another one. The map key is also > >> different. Yes, maybe we can reuse the different key size concept in > >> bpf_cgroup_storage_key in some way but still feel too much unnecessary quirks > >> for the existing sk/inode/task storage users to remember. > >> > >> imo, it is better to keep them separate and have a different map-type. Adding a > >> map flag or using map extra will make it sounds like an extension which it is not. > > > > I was actually proposing considering the existing cgroup storage as an > > extension to the new cgroup local storage. Basically the new cgroup > > local storage is a generic cgroup-indexed map, and for cgroup-attached > > programs they get some nice extensions, such as preallocation (create > > local storage on attachment) and fast lookups (stash a pointer to the > > attached cgroup storage for direct access). There are, of course, some > > quirks, but it felt to me like something that is easier to reason > > about, and less code to maintain > Like extending the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE implementation and add > codes to make it work like the existing BPF_MAP_TYPE_CGROUP_STORAGE such that > those existing code can go away? > > hmm..... A quick thought is it probably does not worth it for the code removal > purpose alone. If all use cases can be satisfied by the > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, retiring the existing one eventually may be a > cleaner answer instead of re-factoring it. > > Pre-allocation could be useful. The user space can do it by using > bpf_map_update_elem syscall with the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE > before attaching the program. > > For fast-lookup/stash pointer, yes, the current limitation on a bpf prog can use > only one BPF_MAP_TYPE_CGROUP_STORAGE makes this easier. However, afaik, the > existing bpf_get_local_storage() is also doing > current->bpf_ctx->prog_item->cgroup_storage. It is not clear to me which one may > be faster though. Need a micro benchmark to tell. > > Also, there are quite many code in local_storage.c. Not sure all of them makes > sense for the new BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE to support. eg. > ".map_get_next_key = cgroup_storage_get_next_key". The new > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE does not support iteration from the user space > because it has bpf_iter that supports iteration by a bpf prog which can directly > get to the kernel ptr (task/sk/...) instead of a fd. > > In the future, we will add feature to bpf_local_storage.c that will work for all > kernel objects whenever possible. eg. Adding map-in-map in the > sk/inode/task/cgroup local storage, and store a ring-buf map to the sk (eg) > storage. The inner map may not always make sense to be created during the > cgroup-attach time and it will be another exception to make for the > alloc-during-cgroup-attach behavior. > > > > > For the helpers, we can maintain the existing one and generalize it > > (get the local storage for my cgroup), and add a new one that we pass > > the cgroup into (as in this patch). > > > > My idea is not to have a different flag or key size, but just > > basically rework the existing cgroup storage as an extension to the > > new one for cgroup-attached programs. > I see what you mean, thanks for clarifying your thoughts. I think retiring the old cgroup storage at some point if all its use cases become covered by the new cgroup local storage. Meanwhile, we will need clear docs in the code and for users to draw a clear distinction between both map types. > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 21:07 ` Martin KaFai Lau 2022-10-17 21:23 ` Yosry Ahmed @ 2022-10-17 22:16 ` sdf 2022-10-18 0:52 ` Martin KaFai Lau 1 sibling, 1 reply; 38+ messages in thread From: sdf @ 2022-10-17 22:16 UTC (permalink / raw) To: Martin KaFai Lau Cc: Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17, Martin KaFai Lau wrote: > On 10/17/22 12:11 PM, Yosry Ahmed wrote: > > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> > wrote: > > > > > > On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> > wrote: > > > > > > > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev > <sdf@google.com> wrote: > > > > > > > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed > <yosryahmed@google.com> wrote: > > > > > > > > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > > > > > > > > > > > > > > On 10/13, Yonghong Song wrote: > > > > > > > > Similar to sk/inode/task storage, implement similar cgroup > local storage. > > > > > > > > > > > > > > > There already exists a local storage implementation for > cgroup-attached > > > > > > > > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and > helper > > > > > > > > bpf_get_local_storage(). But there are use cases such that > non-cgroup > > > > > > > > attached bpf progs wants to access cgroup local storage > data. For example, > > > > > > > > tc egress prog has access to sk and cgroup. It is possible > to use > > > > > > > > sk local storage to emulate cgroup local storage by storing > data in > > > > > > > > socket. > > > > > > > > But this is a waste as it could be lots of sockets > belonging to a > > > > > > > > particular > > > > > > > > cgroup. Alternatively, a separate map can be created with > cgroup id as > > > > > > > > the key. > > > > > > > > But this will introduce additional overhead to manipulate > the new map. > > > > > > > > A cgroup local storage, similar to existing sk/inode/task > storage, > > > > > > > > should help for this use case. > > > > > > > > > > > > > > > The life-cycle of storage is managed with the life-cycle of > the > > > > > > > > cgroup struct. i.e. the storage is destroyed along with > the owning cgroup > > > > > > > > with a callback to the bpf_cgroup_storage_free when cgroup > itself > > > > > > > > is deleted. > > > > > > > > > > > > > > > The userspace map operations can be done by using a cgroup > fd as a key > > > > > > > > passed to the lookup, update and delete operations. > > > > > > > > > > > > > > > > > > > > > [..] > > > > > > > > > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used > for old cgroup > > > > > > > > local > > > > > > > > storage support, the new map name > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > > > > > > > > used > > > > > > > > for cgroup storage available to non-cgroup-attached bpf > programs. The two > > > > > > > > helpers are named as bpf_cgroup_local_storage_get() and > > > > > > > > bpf_cgroup_local_storage_delete(). > > > > > > > > > > > > > > Have you considered doing something similar to 7d9c3427894f > ("bpf: Make > > > > > > > cgroup storages shared between programs on the same cgroup") > where > > > > > > > the map changes its behavior depending on the key size (see > key_size checks > > > > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd > still > > > > > > > can be used so we can, in theory, reuse the name.. > > > > > > > > > > > > > > Pros: > > > > > > > - no need for a new map name > > > > > > > > > > > > > > Cons: > > > > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; > might be not a > > > > > > > good idea to add more stuff to it? > > > > > > > > > > > > > > But, for the very least, should we also extend > > > > > > > Documentation/bpf/map_cgroup_storage.rst to cover the new > map? We've > > > > > > > tried to keep some of the important details in there.. > > > > > > > > > > > > This might be a long shot, but is it possible to switch > completely to > > > > > > this new generic cgroup storage, and for programs that attach to > > > > > > cgroups we can still do lookups/allocations during attachment > like we > > > > > > do today? IOW, maintain the current API for cgroup progs but > switch it > > > > > > to use this new map type instead. > > > > > > > > > > > > It feels like this map type is more generic and can be a > superset of > > > > > > the existing cgroup storage, but I feel like I am missing > something. > > > > > > > > > > I feel like the biggest issue is that the existing > > > > > bpf_get_local_storage helper is guaranteed to always return > non-null > > > > > and the verifier doesn't require the programs to do null checks > on it; > > > > > the new helper might return NULL making all existing programs > fail the > > > > > verifier. > > > > > > > > What I meant is, keep the old bpf_get_local_storage helper only for > > > > cgroup-attached programs like we have today, and add a new generic > > > > bpf_cgroup_local_storage_get() helper. > > > > > > > > For cgroup-attached programs, make sure a cgroup storage entry is > > > > allocated and hooked to the helper on program attach time, to keep > > > > today's behavior constant. > > > > > > > > For other programs, the bpf_cgroup_local_storage_get() will do the > > > > normal lookup and allocate if necessary. > > > > > > > > Does this make any sense to you? > > > > > > But then you also need to somehow mark these to make sure it's not > > > possible to delete them as long as the program is loaded/attached? Not > > > saying it's impossible, but it's a bit of a departure from the > > > existing common local storage framework used by inode/task; not sure > > > whether we want to pull all this complexity in there? But we can > > > definitely try if there is a wider agreement.. > > > > I agree that it's not ideal, but it feels like we are comparing two > > non-ideal options anyway, I am just throwing ideas around :) > I don't think it is a good idea to marry the new > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing > BPF_MAP_TYPE_CGROUP_STORAGE in any way. The API is very different. A few > have already been mentioned here. Delete is one. Storage creation time > is > another one. The map key is also different. Yes, maybe we can reuse the > different key size concept in bpf_cgroup_storage_key in some way but still > feel too much unnecessary quirks for the existing sk/inode/task storage > users to remember. > imo, it is better to keep them separate and have a different map-type. > Adding a map flag or using map extra will make it sounds like an extension > which it is not. This part is the most confusing to me: BPF_MAP_TYPE_CGROUP_STORAGE bpf_get_local_storage BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_local_storage_get The new helpers should probably drop 'local' name to match the task/inode ([0])? And we're left with: BPF_MAP_TYPE_CGROUP_STORAGE bpf_get_local_storage BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_storage_get You read CGROUP_STORAGE via get_local_storage and you read CGROUP_LOCAL_STORAGE via cgroup_storage_get :-/ That's why I'm slightly tilting towards reusing the name. At least we can add a big DEPRECATED message for bpf_get_local_storage and that seems to be it? All those extra key sizes can also be deprecated, but I'm honestly not sure if anybody is using them. But having a separate map also seems fine, as long as we have a patch to update the existing header documentation. (and mention in Documentation/bpf/map_cgroup_storage.rst that there is a replacement?) Current bpf_get_local_storage description is too vague; let's at least mention that it works only with BPF_MAP_TYPE_CGROUP_STORAGE. 0: https://lore.kernel.org/bpf/6ce7d490-f015-531f-3dbb-b6f7717f0590@meta.com/T/#mb2107250caa19a8d9ec3549a52f4a9698be99e33 > > > > > > > > There might be something else I don't remember at this point > (besides > > > > > that weird per-prog_type that we'd have to emulate as well).. > > > > > > > > Yeah there are things that will need to be emulated, but I feel like > > > > we may end up with less confusing code (and less code in general). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 22:16 ` sdf @ 2022-10-18 0:52 ` Martin KaFai Lau 2022-10-18 5:59 ` Yonghong Song 0 siblings, 1 reply; 38+ messages in thread From: Martin KaFai Lau @ 2022-10-18 0:52 UTC (permalink / raw) To: sdf Cc: Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 3:16 PM, sdf@google.com wrote: > On 10/17, Martin KaFai Lau wrote: >> On 10/17/22 12:11 PM, Yosry Ahmed wrote: >> > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev <sdf@google.com> wrote: >> > > >> > > On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed <yosryahmed@google.com> wrote: >> > > > >> > > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: >> > > > > >> > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> >> wrote: >> > > > > > >> > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: >> > > > > > > >> > > > > > > On 10/13, Yonghong Song wrote: >> > > > > > > > Similar to sk/inode/task storage, implement similar cgroup local >> storage. >> > > > > > > >> > > > > > > > There already exists a local storage implementation for >> cgroup-attached >> > > > > > > > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper >> > > > > > > > bpf_get_local_storage(). But there are use cases such that >> non-cgroup >> > > > > > > > attached bpf progs wants to access cgroup local storage data. >> For example, >> > > > > > > > tc egress prog has access to sk and cgroup. It is possible to use >> > > > > > > > sk local storage to emulate cgroup local storage by storing data in >> > > > > > > > socket. >> > > > > > > > But this is a waste as it could be lots of sockets belonging to a >> > > > > > > > particular >> > > > > > > > cgroup. Alternatively, a separate map can be created with cgroup >> id as >> > > > > > > > the key. >> > > > > > > > But this will introduce additional overhead to manipulate the >> new map. >> > > > > > > > A cgroup local storage, similar to existing sk/inode/task storage, >> > > > > > > > should help for this use case. >> > > > > > > >> > > > > > > > The life-cycle of storage is managed with the life-cycle of the >> > > > > > > > cgroup struct. i.e. the storage is destroyed along with the >> owning cgroup >> > > > > > > > with a callback to the bpf_cgroup_storage_free when cgroup itself >> > > > > > > > is deleted. >> > > > > > > >> > > > > > > > The userspace map operations can be done by using a cgroup fd as >> a key >> > > > > > > > passed to the lookup, update and delete operations. >> > > > > > > >> > > > > > > >> > > > > > > [..] >> > > > > > > >> > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old >> cgroup >> > > > > > > > local >> > > > > > > > storage support, the new map name >> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is >> > > > > > > > used >> > > > > > > > for cgroup storage available to non-cgroup-attached bpf >> programs. The two >> > > > > > > > helpers are named as bpf_cgroup_local_storage_get() and >> > > > > > > > bpf_cgroup_local_storage_delete(). >> > > > > > > >> > > > > > > Have you considered doing something similar to 7d9c3427894f ("bpf: >> Make >> > > > > > > cgroup storages shared between programs on the same cgroup") where >> > > > > > > the map changes its behavior depending on the key size (see >> key_size checks >> > > > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still >> > > > > > > can be used so we can, in theory, reuse the name.. >> > > > > > > >> > > > > > > Pros: >> > > > > > > - no need for a new map name >> > > > > > > >> > > > > > > Cons: >> > > > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be >> not a >> > > > > > > good idea to add more stuff to it? >> > > > > > > >> > > > > > > But, for the very least, should we also extend >> > > > > > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've >> > > > > > > tried to keep some of the important details in there.. >> > > > > > >> > > > > > This might be a long shot, but is it possible to switch completely to >> > > > > > this new generic cgroup storage, and for programs that attach to >> > > > > > cgroups we can still do lookups/allocations during attachment like we >> > > > > > do today? IOW, maintain the current API for cgroup progs but switch it >> > > > > > to use this new map type instead. >> > > > > > >> > > > > > It feels like this map type is more generic and can be a superset of >> > > > > > the existing cgroup storage, but I feel like I am missing something. >> > > > > >> > > > > I feel like the biggest issue is that the existing >> > > > > bpf_get_local_storage helper is guaranteed to always return non-null >> > > > > and the verifier doesn't require the programs to do null checks on it; >> > > > > the new helper might return NULL making all existing programs fail the >> > > > > verifier. >> > > > >> > > > What I meant is, keep the old bpf_get_local_storage helper only for >> > > > cgroup-attached programs like we have today, and add a new generic >> > > > bpf_cgroup_local_storage_get() helper. >> > > > >> > > > For cgroup-attached programs, make sure a cgroup storage entry is >> > > > allocated and hooked to the helper on program attach time, to keep >> > > > today's behavior constant. >> > > > >> > > > For other programs, the bpf_cgroup_local_storage_get() will do the >> > > > normal lookup and allocate if necessary. >> > > > >> > > > Does this make any sense to you? >> > > >> > > But then you also need to somehow mark these to make sure it's not >> > > possible to delete them as long as the program is loaded/attached? Not >> > > saying it's impossible, but it's a bit of a departure from the >> > > existing common local storage framework used by inode/task; not sure >> > > whether we want to pull all this complexity in there? But we can >> > > definitely try if there is a wider agreement.. >> > >> > I agree that it's not ideal, but it feels like we are comparing two >> > non-ideal options anyway, I am just throwing ideas around :) > >> I don't think it is a good idea to marry the new >> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing >> BPF_MAP_TYPE_CGROUP_STORAGE in any way. The API is very different. A few >> have already been mentioned here. Delete is one. Storage creation time is >> another one. The map key is also different. Yes, maybe we can reuse the >> different key size concept in bpf_cgroup_storage_key in some way but still >> feel too much unnecessary quirks for the existing sk/inode/task storage >> users to remember. > >> imo, it is better to keep them separate and have a different map-type. >> Adding a map flag or using map extra will make it sounds like an extension >> which it is not. > > This part is the most confusing to me: > > BPF_MAP_TYPE_CGROUP_STORAGE bpf_get_local_storage > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_local_storage_get > > The new helpers should probably drop 'local' name to match the task/inode ([0])? > And we're left with: > > BPF_MAP_TYPE_CGROUP_STORAGE bpf_get_local_storage > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_storage_get > > You read CGROUP_STORAGE via get_local_storage and > you read CGROUP_LOCAL_STORAGE via cgroup_storage_get :-/ Yep, agree that it is not ideal :( > > That's why I'm slightly tilting towards reusing the name. At least we can > add a big DEPRECATED message for bpf_get_local_storage and that seems to be > it? All those extra key sizes can also be deprecated, but I'm honestly > not sure if anybody is using them. Reusing 'key_size == sizeof(int)' to mean new map type...hmm... I have been thinking about it after your suggestion in another reply since it can use the BPF_MAP_TYPE_CGROUP_STORAGE name. I wish the BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE was given to the bpf_get_local_storage() instead because it is a better name to describe what it is doing. hmm.... However, this feels working like a map_flags or map_extra but in a more hidden way. I am worry it will actually be more confusing and also having usage surprises when there are quite many behavior differences that this thread has already mentioned. That will be hard for the user to reason those API differences just because of using a different key_size. May be going back to revisit the naming a little bit. How about giving a new and likely more correct 'BPF_MAP_TYPE_CGRP_LOCAL_STORAGE' name for the existing bpf_get_local_storage() use. Then '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi. The new cgroup storage uses a shorter name "cgrp", like BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()? > > But having a separate map also seems fine, as long as we have a patch to > update the existing header documentation. (and mention in > Documentation/bpf/map_cgroup_storage.rst that there is a replacement?) > Current bpf_get_local_storage description is too vague; let's at least > mention that it works only with BPF_MAP_TYPE_CGROUP_STORAGE. > > 0: > https://lore.kernel.org/bpf/6ce7d490-f015-531f-3dbb-b6f7717f0590@meta.com/T/#mb2107250caa19a8d9ec3549a52f4a9698be99e33 > >> > > >> > > > > There might be something else I don't remember at this point (besides >> > > > > that weird per-prog_type that we'd have to emulate as well).. >> > > > >> > > > Yeah there are things that will need to be emulated, but I feel like >> > > > we may end up with less confusing code (and less code in general). > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-18 0:52 ` Martin KaFai Lau @ 2022-10-18 5:59 ` Yonghong Song 2022-10-18 17:08 ` sdf 0 siblings, 1 reply; 38+ messages in thread From: Yonghong Song @ 2022-10-18 5:59 UTC (permalink / raw) To: Martin KaFai Lau, sdf Cc: Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 5:52 PM, Martin KaFai Lau wrote: > On 10/17/22 3:16 PM, sdf@google.com wrote: >> On 10/17, Martin KaFai Lau wrote: >>> On 10/17/22 12:11 PM, Yosry Ahmed wrote: >>> > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev >>> <sdf@google.com> wrote: >>> > > >>> > > On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed >>> <yosryahmed@google.com> wrote: >>> > > > >>> > > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev >>> <sdf@google.com> wrote: >>> > > > > >>> > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed >>> <yosryahmed@google.com> wrote: >>> > > > > > >>> > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: >>> > > > > > > >>> > > > > > > On 10/13, Yonghong Song wrote: >>> > > > > > > > Similar to sk/inode/task storage, implement similar >>> cgroup local storage. >>> > > > > > > >>> > > > > > > > There already exists a local storage implementation for >>> cgroup-attached >>> > > > > > > > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE >>> and helper >>> > > > > > > > bpf_get_local_storage(). But there are use cases such >>> that non-cgroup >>> > > > > > > > attached bpf progs wants to access cgroup local storage >>> data. For example, >>> > > > > > > > tc egress prog has access to sk and cgroup. It is >>> possible to use >>> > > > > > > > sk local storage to emulate cgroup local storage by >>> storing data in >>> > > > > > > > socket. >>> > > > > > > > But this is a waste as it could be lots of sockets >>> belonging to a >>> > > > > > > > particular >>> > > > > > > > cgroup. Alternatively, a separate map can be created >>> with cgroup id as >>> > > > > > > > the key. >>> > > > > > > > But this will introduce additional overhead to >>> manipulate the new map. >>> > > > > > > > A cgroup local storage, similar to existing >>> sk/inode/task storage, >>> > > > > > > > should help for this use case. >>> > > > > > > >>> > > > > > > > The life-cycle of storage is managed with the >>> life-cycle of the >>> > > > > > > > cgroup struct. i.e. the storage is destroyed along >>> with the owning cgroup >>> > > > > > > > with a callback to the bpf_cgroup_storage_free when >>> cgroup itself >>> > > > > > > > is deleted. >>> > > > > > > >>> > > > > > > > The userspace map operations can be done by using a >>> cgroup fd as a key >>> > > > > > > > passed to the lookup, update and delete operations. >>> > > > > > > >>> > > > > > > >>> > > > > > > [..] >>> > > > > > > >>> > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been >>> used for old cgroup >>> > > > > > > > local >>> > > > > > > > storage support, the new map name >>> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is >>> > > > > > > > used >>> > > > > > > > for cgroup storage available to non-cgroup-attached bpf >>> programs. The two >>> > > > > > > > helpers are named as bpf_cgroup_local_storage_get() and >>> > > > > > > > bpf_cgroup_local_storage_delete(). >>> > > > > > > >>> > > > > > > Have you considered doing something similar to >>> 7d9c3427894f ("bpf: Make >>> > > > > > > cgroup storages shared between programs on the same >>> cgroup") where >>> > > > > > > the map changes its behavior depending on the key size >>> (see key_size checks >>> > > > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for >>> fd still >>> > > > > > > can be used so we can, in theory, reuse the name.. >>> > > > > > > >>> > > > > > > Pros: >>> > > > > > > - no need for a new map name >>> > > > > > > >>> > > > > > > Cons: >>> > > > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; >>> might be not a >>> > > > > > > good idea to add more stuff to it? >>> > > > > > > >>> > > > > > > But, for the very least, should we also extend >>> > > > > > > Documentation/bpf/map_cgroup_storage.rst to cover the new >>> map? We've >>> > > > > > > tried to keep some of the important details in there.. >>> > > > > > >>> > > > > > This might be a long shot, but is it possible to switch >>> completely to >>> > > > > > this new generic cgroup storage, and for programs that >>> attach to >>> > > > > > cgroups we can still do lookups/allocations during >>> attachment like we >>> > > > > > do today? IOW, maintain the current API for cgroup progs >>> but switch it >>> > > > > > to use this new map type instead. >>> > > > > > >>> > > > > > It feels like this map type is more generic and can be a >>> superset of >>> > > > > > the existing cgroup storage, but I feel like I am missing >>> something. >>> > > > > >>> > > > > I feel like the biggest issue is that the existing >>> > > > > bpf_get_local_storage helper is guaranteed to always return >>> non-null >>> > > > > and the verifier doesn't require the programs to do null >>> checks on it; >>> > > > > the new helper might return NULL making all existing programs >>> fail the >>> > > > > verifier. >>> > > > >>> > > > What I meant is, keep the old bpf_get_local_storage helper only >>> for >>> > > > cgroup-attached programs like we have today, and add a new generic >>> > > > bpf_cgroup_local_storage_get() helper. >>> > > > >>> > > > For cgroup-attached programs, make sure a cgroup storage entry is >>> > > > allocated and hooked to the helper on program attach time, to keep >>> > > > today's behavior constant. >>> > > > >>> > > > For other programs, the bpf_cgroup_local_storage_get() will do the >>> > > > normal lookup and allocate if necessary. >>> > > > >>> > > > Does this make any sense to you? >>> > > >>> > > But then you also need to somehow mark these to make sure it's not >>> > > possible to delete them as long as the program is >>> loaded/attached? Not >>> > > saying it's impossible, but it's a bit of a departure from the >>> > > existing common local storage framework used by inode/task; not sure >>> > > whether we want to pull all this complexity in there? But we can >>> > > definitely try if there is a wider agreement.. >>> > >>> > I agree that it's not ideal, but it feels like we are comparing two >>> > non-ideal options anyway, I am just throwing ideas around :) >> >>> I don't think it is a good idea to marry the new >>> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing >>> BPF_MAP_TYPE_CGROUP_STORAGE in any way. The API is very different. >>> A few >>> have already been mentioned here. Delete is one. Storage creation >>> time is >>> another one. The map key is also different. Yes, maybe we can reuse >>> the >>> different key size concept in bpf_cgroup_storage_key in some way but >>> still >>> feel too much unnecessary quirks for the existing sk/inode/task storage >>> users to remember. >> >>> imo, it is better to keep them separate and have a different map-type. >>> Adding a map flag or using map extra will make it sounds like an >>> extension >>> which it is not. >> >> This part is the most confusing to me: >> >> BPF_MAP_TYPE_CGROUP_STORAGE bpf_get_local_storage >> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_local_storage_get >> >> The new helpers should probably drop 'local' name to match the >> task/inode ([0])? >> And we're left with: >> >> BPF_MAP_TYPE_CGROUP_STORAGE bpf_get_local_storage >> BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_storage_get >> >> You read CGROUP_STORAGE via get_local_storage and >> you read CGROUP_LOCAL_STORAGE via cgroup_storage_get :-/ > > Yep, agree that it is not ideal :( I guess I need to add more documentation to explain the difference of old and new map regardless of the final names. > >> >> That's why I'm slightly tilting towards reusing the name. At least we can >> add a big DEPRECATED message for bpf_get_local_storage and that seems >> to be >> it? All those extra key sizes can also be deprecated, but I'm honestly >> not sure if anybody is using them. > > Reusing 'key_size == sizeof(int)' to mean new map type...hmm... I have > been thinking about it after your suggestion in another reply since it > can use the BPF_MAP_TYPE_CGROUP_STORAGE name. I wish the > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE was given to the > bpf_get_local_storage() instead because it is a better name to describe > what it is doing. > > hmm.... However, this feels working like a map_flags or map_extra but in > a more hidden way. I am worry it will actually be more confusing and > also having usage surprises when there are quite many behavior > differences that this thread has already mentioned. That will be hard > for the user to reason those API differences just because of using a > different key_size. > > May be going back to revisit the naming a little bit. How about giving > a new and likely more correct 'BPF_MAP_TYPE_CGRP_LOCAL_STORAGE' name for > the existing bpf_get_local_storage() use. Then > > '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* > depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi. > > The new cgroup storage uses a shorter name "cgrp", like > BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()? This might work and the naming convention will be similar to existing sk/inode/task storage. Another alternative is to name the map name as BPF_MAP_TYPE_CGROUP_STORAGE2 to indicate it is a different version of cgroup_storage map and the documentation should explain the difference clearly. This should avoid the possible confusion between BPF_MAP_TYPE_CGROUP_STORAGE and BPF_MAP_TYPE_CGRP_STORAGE. > >> >> But having a separate map also seems fine, as long as we have a patch to >> update the existing header documentation. (and mention in >> Documentation/bpf/map_cgroup_storage.rst that there is a replacement?) >> Current bpf_get_local_storage description is too vague; let's at least >> mention that it works only with BPF_MAP_TYPE_CGROUP_STORAGE. >> >> 0: >> https://lore.kernel.org/bpf/6ce7d490-f015-531f-3dbb-b6f7717f0590@meta.com/T/#mb2107250caa19a8d9ec3549a52f4a9698be99e33 >> >>> > > >>> > > > > There might be something else I don't remember at this point >>> (besides >>> > > > > that weird per-prog_type that we'd have to emulate as well).. >>> > > > >>> > > > Yeah there are things that will need to be emulated, but I feel >>> like >>> > > > we may end up with less confusing code (and less code in general). >> >> > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-18 5:59 ` Yonghong Song @ 2022-10-18 17:08 ` sdf 2022-10-18 17:17 ` Alexei Starovoitov 0 siblings, 1 reply; 38+ messages in thread From: sdf @ 2022-10-18 17:08 UTC (permalink / raw) To: Yonghong Song Cc: Martin KaFai Lau, Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17, Yonghong Song wrote: > On 10/17/22 5:52 PM, Martin KaFai Lau wrote: > > On 10/17/22 3:16 PM, sdf@google.com wrote: > > > On 10/17, Martin KaFai Lau wrote: > > > > On 10/17/22 12:11 PM, Yosry Ahmed wrote: > > > > > On Mon, Oct 17, 2022 at 12:07 PM Stanislav Fomichev > > > > <sdf@google.com> wrote: > > > > > > > > > > > > On Mon, Oct 17, 2022 at 11:47 AM Yosry Ahmed > > > > <yosryahmed@google.com> wrote: > > > > > > > > > > > > > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev > > > > <sdf@google.com> wrote: > > > > > > > > > > > > > > > > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed > > > > <yosryahmed@google.com> wrote: > > > > > > > > > > > > > > > > > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > > > > > > > > > > > > > > > > > > > > On 10/13, Yonghong Song wrote: > > > > > > > > > > > Similar to sk/inode/task storage, implement > > > > similar cgroup local storage. > > > > > > > > > > > > > > > > > > > > > There already exists a local storage > > > > implementation for cgroup-attached > > > > > > > > > > > bpf programs.� See map type > > > > BPF_MAP_TYPE_CGROUP_STORAGE and helper > > > > > > > > > > > bpf_get_local_storage(). But there are use cases > > > > such that non-cgroup > > > > > > > > > > > attached bpf progs wants to access cgroup local > > > > storage data. For example, > > > > > > > > > > > tc egress prog has access to sk and cgroup. It is > > > > possible to use > > > > > > > > > > > sk local storage to emulate cgroup local storage > > > > by storing data in > > > > > > > > > > > socket. > > > > > > > > > > > But this is a waste as it could be lots of sockets > > > > belonging to a > > > > > > > > > > > particular > > > > > > > > > > > cgroup. Alternatively, a separate map can be > > > > created with cgroup id as > > > > > > > > > > > the key. > > > > > > > > > > > But this will introduce additional overhead to > > > > manipulate the new map. > > > > > > > > > > > A cgroup local storage, similar to existing > > > > sk/inode/task storage, > > > > > > > > > > > should help for this use case. > > > > > > > > > > > > > > > > > > > > > The life-cycle of storage is managed with the > > > > life-cycle of the > > > > > > > > > > > cgroup struct.� i.e. the storage is destroyed > > > > along with the owning cgroup > > > > > > > > > > > with a callback to the bpf_cgroup_storage_free > > > > when cgroup itself > > > > > > > > > > > is deleted. > > > > > > > > > > > > > > > > > > > > > The userspace map operations can be done by using > > > > a cgroup fd as a key > > > > > > > > > > > passed to the lookup, update and delete operations. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [..] > > > > > > > > > > > > > > > > > > > > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has > > > > been used for old cgroup > > > > > > > > > > > local > > > > > > > > > > > storage support, the new map name > > > > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > > > > > > > > > > > used > > > > > > > > > > > for cgroup storage available to > > > > non-cgroup-attached bpf programs. The two > > > > > > > > > > > helpers are named as bpf_cgroup_local_storage_get() > and > > > > > > > > > > > bpf_cgroup_local_storage_delete(). > > > > > > > > > > > > > > > > > > > > Have you considered doing something similar to > > > > 7d9c3427894f ("bpf: Make > > > > > > > > > > cgroup storages shared between programs on the same > > > > cgroup") where > > > > > > > > > > the map changes its behavior depending on the key > > > > size (see key_size checks > > > > > > > > > > in cgroup_storage_map_alloc)? Looks like sizeof(int) > > > > for fd still > > > > > > > > > > can be used so we can, in theory, reuse the name.. > > > > > > > > > > > > > > > > > > > > Pros: > > > > > > > > > > - no need for a new map name > > > > > > > > > > > > > > > > > > > > Cons: > > > > > > > > > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already > > > > messy; might be not a > > > > > > > > > >���� good idea to add more stuff to it? > > > > > > > > > > > > > > > > > > > > But, for the very least, should we also extend > > > > > > > > > > Documentation/bpf/map_cgroup_storage.rst to cover > > > > the new map? We've > > > > > > > > > > tried to keep some of the important details in there.. > > > > > > > > > > > > > > > > > > This might be a long shot, but is it possible to > > > > switch completely to > > > > > > > > > this new generic cgroup storage, and for programs that > > > > attach to > > > > > > > > > cgroups we can still do lookups/allocations during > > > > attachment like we > > > > > > > > > do today? IOW, maintain the current API for cgroup > > > > progs but switch it > > > > > > > > > to use this new map type instead. > > > > > > > > > > > > > > > > > > It feels like this map type is more generic and can be > > > > a superset of > > > > > > > > > the existing cgroup storage, but I feel like I am > > > > missing something. > > > > > > > > > > > > > > > > I feel like the biggest issue is that the existing > > > > > > > > bpf_get_local_storage helper is guaranteed to always > > > > return non-null > > > > > > > > and the verifier doesn't require the programs to do null > > > > checks on it; > > > > > > > > the new helper might return NULL making all existing > > > > programs fail the > > > > > > > > verifier. > > > > > > > > > > > > > > What I meant is, keep the old bpf_get_local_storage helper > > > > only for > > > > > > > cgroup-attached programs like we have today, and add a new > generic > > > > > > > bpf_cgroup_local_storage_get() helper. > > > > > > > > > > > > > > For cgroup-attached programs, make sure a cgroup storage > entry is > > > > > > > allocated and hooked to the helper on program attach time, to > keep > > > > > > > today's behavior constant. > > > > > > > > > > > > > > For other programs, the bpf_cgroup_local_storage_get() will > do the > > > > > > > normal lookup and allocate if necessary. > > > > > > > > > > > > > > Does this make any sense to you? > > > > > > > > > > > > But then you also need to somehow mark these to make sure it's > not > > > > > > possible to delete them as long as the program is > > > > loaded/attached? Not > > > > > > saying it's impossible, but it's a bit of a departure from the > > > > > > existing common local storage framework used by inode/task; not > sure > > > > > > whether we want to pull all this complexity in there? But we can > > > > > > definitely try if there is a wider agreement.. > > > > > > > > > > I agree that it's not ideal, but it feels like we are comparing > two > > > > > non-ideal options anyway, I am just throwing ideas around :) > > > > > > > I don't think it is a good idea to marry the new > > > > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and the existing > > > > BPF_MAP_TYPE_CGROUP_STORAGE in any way.� The API is very > > > > different. A few > > > > have already been mentioned here.� Delete is one.� Storage > > > > creation time is > > > > another one.� The map key is also different.� Yes, maybe we can > > > > reuse the > > > > different key size concept in bpf_cgroup_storage_key in some way > > > > but still > > > > feel too much unnecessary quirks for the existing sk/inode/task > storage > > > > users to remember. > > > > > > > imo, it is better to keep them separate and have a different > map-type. > > > > Adding a map flag or using map extra will make it sounds like an > > > > extension > > > > which it is not. > > > > > > This part is the most confusing to me: > > > > > > BPF_MAP_TYPE_CGROUP_STORAGE������ bpf_get_local_storage > > > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_local_storage_get > > > > > > The new helpers should probably drop 'local' name to match the > > > task/inode ([0])? > > > And we're left with: > > > > > > BPF_MAP_TYPE_CGROUP_STORAGE������ bpf_get_local_storage > > > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE bpf_cgroup_storage_get > > > > > > You read CGROUP_STORAGE via get_local_storage and > > > you read CGROUP_LOCAL_STORAGE via cgroup_storage_get :-/ > > > > Yep, agree that it is not ideal :( > I guess I need to add more documentation to explain the difference > of old and new map regardless of the final names. > > > > > > > > That's why I'm slightly tilting towards reusing the name. At least we > can > > > add a big DEPRECATED message for bpf_get_local_storage and that > > > seems to be > > > it? All those extra key sizes can also be deprecated, but I'm honestly > > > not sure if anybody is using them. > > > > Reusing 'key_size == sizeof(int)' to mean new map type...hmm...� I have > > been thinking about it after your suggestion in another reply since it > > can use the BPF_MAP_TYPE_CGROUP_STORAGE name.� I wish the > > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE was given to the > > bpf_get_local_storage() instead because it is a better name to describe > > what it is doing. > > > > hmm.... However, this feels working like a map_flags or map_extra but in > > a more hidden way.� I am worry it will actually be more confusing and > > also having usage surprises when there are quite many behavior > > differences that this thread has already mentioned.� That will be hard > > for the user to reason those API differences just because of using a > > different key_size. > > > > May be going back to revisit the naming a little bit.� How about giving > > a new and likely more correct 'BPF_MAP_TYPE_CGRP_LOCAL_STORAGE' name for > > the existing bpf_get_local_storage() use.� Then > > > > '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* > > depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi. > > > > The new cgroup storage uses a shorter name "cgrp", like > > BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()? > This might work and the naming convention will be similar to > existing sk/inode/task storage. +1, CGRP_STORAGE sounds good! > Another alternative is to name the map name as > BPF_MAP_TYPE_CGROUP_STORAGE2 > to indicate it is a different version of cgroup_storage map > and the documentation should explain the difference clearly. > This should avoid the possible confusion between > BPF_MAP_TYPE_CGROUP_STORAGE and BPF_MAP_TYPE_CGRP_STORAGE. > > > > > > > > But having a separate map also seems fine, as long as we have a patch > to > > > update the existing header documentation. (and mention in > > > Documentation/bpf/map_cgroup_storage.rst that there is a replacement?) > > > Current bpf_get_local_storage description is too vague; let's at least > > > mention that it works only with BPF_MAP_TYPE_CGROUP_STORAGE. > > > > > > 0: > https://lore.kernel.org/bpf/6ce7d490-f015-531f-3dbb-b6f7717f0590@meta.com/T/#mb2107250caa19a8d9ec3549a52f4a9698be99e33 > > > > > > > > > > > > > > > > > There might be something else I don't remember at this > > > > point (besides > > > > > > > > that weird per-prog_type that we'd have to emulate as > well).. > > > > > > > > > > > > > > Yeah there are things that will need to be emulated, but I > > > > feel like > > > > > > > we may end up with less confusing code (and less code in > general). > > > > > > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-18 17:08 ` sdf @ 2022-10-18 17:17 ` Alexei Starovoitov 2022-10-18 18:08 ` Martin KaFai Lau 2022-10-18 23:12 ` Andrii Nakryiko 0 siblings, 2 replies; 38+ messages in thread From: Alexei Starovoitov @ 2022-10-18 17:17 UTC (permalink / raw) To: Stanislav Fomichev Cc: Yonghong Song, Martin KaFai Lau, Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Kernel Team, KP Singh, Martin KaFai Lau, Tejun Heo On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote: > > > > > > '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* > > > depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi. > > > > > > The new cgroup storage uses a shorter name "cgrp", like > > > BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()? > > > This might work and the naming convention will be similar to > > existing sk/inode/task storage. > > +1, CGRP_STORAGE sounds good! +1 from me as well. Something like this ? diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 17f61338f8f8..13dcb2418847 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -922,7 +922,8 @@ enum bpf_map_type { BPF_MAP_TYPE_CPUMAP, BPF_MAP_TYPE_XSKMAP, BPF_MAP_TYPE_SOCKHASH, - BPF_MAP_TYPE_CGROUP_STORAGE, + BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, + BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, BPF_MAP_TYPE_QUEUE, @@ -935,6 +936,7 @@ enum bpf_map_type { BPF_MAP_TYPE_TASK_STORAGE, BPF_MAP_TYPE_BLOOM_FILTER, BPF_MAP_TYPE_USER_RINGBUF, + BPF_MAP_TYPE_CGRP_STORAGE, }; What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ? Probably should come up with a replacement as well? ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-18 17:17 ` Alexei Starovoitov @ 2022-10-18 18:08 ` Martin KaFai Lau 2022-10-18 18:11 ` Yosry Ahmed 2022-10-18 23:12 ` Andrii Nakryiko 1 sibling, 1 reply; 38+ messages in thread From: Martin KaFai Lau @ 2022-10-18 18:08 UTC (permalink / raw) To: Alexei Starovoitov Cc: Yonghong Song, Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Kernel Team, KP Singh, Martin KaFai Lau, Tejun Heo, Stanislav Fomichev On 10/18/22 10:17 AM, Alexei Starovoitov wrote: > On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote: >>>> >>>> '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* >>>> depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi. >>>> >>>> The new cgroup storage uses a shorter name "cgrp", like >>>> BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()? >> >>> This might work and the naming convention will be similar to >>> existing sk/inode/task storage. >> >> +1, CGRP_STORAGE sounds good! > > +1 from me as well. > > Something like this ? > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 17f61338f8f8..13dcb2418847 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -922,7 +922,8 @@ enum bpf_map_type { > BPF_MAP_TYPE_CPUMAP, > BPF_MAP_TYPE_XSKMAP, > BPF_MAP_TYPE_SOCKHASH, > - BPF_MAP_TYPE_CGROUP_STORAGE, > + BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, > + BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, +1 > BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, > BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, > BPF_MAP_TYPE_QUEUE, > @@ -935,6 +936,7 @@ enum bpf_map_type { > BPF_MAP_TYPE_TASK_STORAGE, > BPF_MAP_TYPE_BLOOM_FILTER, > BPF_MAP_TYPE_USER_RINGBUF, > + BPF_MAP_TYPE_CGRP_STORAGE, > }; > > What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ? > Probably should come up with a replacement as well? Yeah, need to come up with a percpu answer for it. The percpu usage has never come up on the sk storage and also the later task/inode storage. or the user is just getting by with an array like map's value. May be the bpf prog can call bpf_mem_alloc() to alloc the percpu memory in the future and then store it as the kptr in the BPF_MAP_TYPE_CGRP_STORAGE? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-18 18:08 ` Martin KaFai Lau @ 2022-10-18 18:11 ` Yosry Ahmed 2022-10-18 18:26 ` Yonghong Song 0 siblings, 1 reply; 38+ messages in thread From: Yosry Ahmed @ 2022-10-18 18:11 UTC (permalink / raw) To: Martin KaFai Lau Cc: Alexei Starovoitov, Yonghong Song, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Kernel Team, KP Singh, Martin KaFai Lau, Tejun Heo, Stanislav Fomichev On Tue, Oct 18, 2022 at 11:08 AM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 10/18/22 10:17 AM, Alexei Starovoitov wrote: > > On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote: > >>>> > >>>> '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* > >>>> depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi. > >>>> > >>>> The new cgroup storage uses a shorter name "cgrp", like > >>>> BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()? > >> > >>> This might work and the naming convention will be similar to > >>> existing sk/inode/task storage. > >> > >> +1, CGRP_STORAGE sounds good! > > > > +1 from me as well. > > > > Something like this ? > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 17f61338f8f8..13dcb2418847 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -922,7 +922,8 @@ enum bpf_map_type { > > BPF_MAP_TYPE_CPUMAP, > > BPF_MAP_TYPE_XSKMAP, > > BPF_MAP_TYPE_SOCKHASH, > > - BPF_MAP_TYPE_CGROUP_STORAGE, > > + BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, > > + BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, > > +1 > > > BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, > > BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, > > BPF_MAP_TYPE_QUEUE, > > @@ -935,6 +936,7 @@ enum bpf_map_type { > > BPF_MAP_TYPE_TASK_STORAGE, > > BPF_MAP_TYPE_BLOOM_FILTER, > > BPF_MAP_TYPE_USER_RINGBUF, > > + BPF_MAP_TYPE_CGRP_STORAGE, > > }; > > > > What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ? > > Probably should come up with a replacement as well? > > Yeah, need to come up with a percpu answer for it. The percpu usage has never > come up on the sk storage and also the later task/inode storage. or the user is > just getting by with an array like map's value. > > May be the bpf prog can call bpf_mem_alloc() to alloc the percpu memory in the > future and then store it as the kptr in the BPF_MAP_TYPE_CGRP_STORAGE? A percpu cgroup storage would be very beneficial for cgroup statistics collection, things like the selftest in tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c currently uses a percpu hashmap indexed by cgroup id, so using a percpu cgroup storage instead would be a nice upgrade. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-18 18:11 ` Yosry Ahmed @ 2022-10-18 18:26 ` Yonghong Song 0 siblings, 0 replies; 38+ messages in thread From: Yonghong Song @ 2022-10-18 18:26 UTC (permalink / raw) To: Yosry Ahmed, Martin KaFai Lau Cc: Alexei Starovoitov, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Kernel Team, KP Singh, Martin KaFai Lau, Tejun Heo, Stanislav Fomichev On 10/18/22 11:11 AM, Yosry Ahmed wrote: > On Tue, Oct 18, 2022 at 11:08 AM Martin KaFai Lau <martin.lau@linux.dev> wrote: >> >> On 10/18/22 10:17 AM, Alexei Starovoitov wrote: >>> On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote: >>>>>> >>>>>> '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* >>>>>> depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi. >>>>>> >>>>>> The new cgroup storage uses a shorter name "cgrp", like >>>>>> BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()? >>>> >>>>> This might work and the naming convention will be similar to >>>>> existing sk/inode/task storage. >>>> >>>> +1, CGRP_STORAGE sounds good! >>> >>> +1 from me as well. >>> >>> Something like this ? >>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >>> index 17f61338f8f8..13dcb2418847 100644 >>> --- a/include/uapi/linux/bpf.h >>> +++ b/include/uapi/linux/bpf.h >>> @@ -922,7 +922,8 @@ enum bpf_map_type { >>> BPF_MAP_TYPE_CPUMAP, >>> BPF_MAP_TYPE_XSKMAP, >>> BPF_MAP_TYPE_SOCKHASH, >>> - BPF_MAP_TYPE_CGROUP_STORAGE, >>> + BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, >>> + BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, >> >> +1 >> >>> BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, >>> BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, >>> BPF_MAP_TYPE_QUEUE, >>> @@ -935,6 +936,7 @@ enum bpf_map_type { >>> BPF_MAP_TYPE_TASK_STORAGE, >>> BPF_MAP_TYPE_BLOOM_FILTER, >>> BPF_MAP_TYPE_USER_RINGBUF, >>> + BPF_MAP_TYPE_CGRP_STORAGE, >>> }; Sounds good to me. Will do this in the next revision. >>> >>> What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ? >>> Probably should come up with a replacement as well? >> >> Yeah, need to come up with a percpu answer for it. The percpu usage has never >> come up on the sk storage and also the later task/inode storage. or the user is >> just getting by with an array like map's value. >> >> May be the bpf prog can call bpf_mem_alloc() to alloc the percpu memory in the >> future and then store it as the kptr in the BPF_MAP_TYPE_CGRP_STORAGE? > > A percpu cgroup storage would be very beneficial for cgroup statistics > collection, things like the selftest in > tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c > currently uses a percpu hashmap indexed by cgroup id, so using a > percpu cgroup storage instead would be a nice upgrade. Indeed, agree. For cgroup storage, we could have a per-cpu version for the new mechanism so it can replace the old one as well. Will look into this after non per-cpu version is done. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-18 17:17 ` Alexei Starovoitov 2022-10-18 18:08 ` Martin KaFai Lau @ 2022-10-18 23:12 ` Andrii Nakryiko 1 sibling, 0 replies; 38+ messages in thread From: Andrii Nakryiko @ 2022-10-18 23:12 UTC (permalink / raw) To: Alexei Starovoitov Cc: Stanislav Fomichev, Yonghong Song, Martin KaFai Lau, Yosry Ahmed, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Kernel Team, KP Singh, Martin KaFai Lau, Tejun Heo On Tue, Oct 18, 2022 at 10:18 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Oct 18, 2022 at 10:08 AM <sdf@google.com> wrote: > > > > > > > > '#define BPF_MAP_TYPE_CGROUP_STORAGE BPF_MAP_TYPE_CGRP_LOCAL_STORAGE /* > > > > depreciated by BPF_MAP_TYPE_CGRP_STORAGE */' in the uapi. > > > > > > > > The new cgroup storage uses a shorter name "cgrp", like > > > > BPF_MAP_TYPE_CGRP_STORAGE and bpf_cgrp_storage_get()? > > > > > This might work and the naming convention will be similar to > > > existing sk/inode/task storage. > > > > +1, CGRP_STORAGE sounds good! > > +1 from me as well. it's totally bikeshedding zone :) but isn't CG_STORAGE just as recognizable but easier to mentally read as well? Like SK for socket, instead of SCKT > > Something like this ? > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 17f61338f8f8..13dcb2418847 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -922,7 +922,8 @@ enum bpf_map_type { > BPF_MAP_TYPE_CPUMAP, > BPF_MAP_TYPE_XSKMAP, > BPF_MAP_TYPE_SOCKHASH, > - BPF_MAP_TYPE_CGROUP_STORAGE, > + BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, > + BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, > BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, > BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, > BPF_MAP_TYPE_QUEUE, > @@ -935,6 +936,7 @@ enum bpf_map_type { > BPF_MAP_TYPE_TASK_STORAGE, > BPF_MAP_TYPE_BLOOM_FILTER, > BPF_MAP_TYPE_USER_RINGBUF, > + BPF_MAP_TYPE_CGRP_STORAGE, > }; > > What are we going to do with BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ? > Probably should come up with a replacement as well? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:47 ` Yosry Ahmed 2022-10-17 19:07 ` Stanislav Fomichev @ 2022-10-17 20:15 ` Yonghong Song 2022-10-17 20:18 ` Yosry Ahmed 1 sibling, 1 reply; 38+ messages in thread From: Yonghong Song @ 2022-10-17 20:15 UTC (permalink / raw) To: Yosry Ahmed, Stanislav Fomichev Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 11:47 AM, Yosry Ahmed wrote: > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: >> >> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: >>> >>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: >>>> >>>> On 10/13, Yonghong Song wrote: >>>>> Similar to sk/inode/task storage, implement similar cgroup local storage. >>>> >>>>> There already exists a local storage implementation for cgroup-attached >>>>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper >>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup >>>>> attached bpf progs wants to access cgroup local storage data. For example, >>>>> tc egress prog has access to sk and cgroup. It is possible to use >>>>> sk local storage to emulate cgroup local storage by storing data in >>>>> socket. >>>>> But this is a waste as it could be lots of sockets belonging to a >>>>> particular >>>>> cgroup. Alternatively, a separate map can be created with cgroup id as >>>>> the key. >>>>> But this will introduce additional overhead to manipulate the new map. >>>>> A cgroup local storage, similar to existing sk/inode/task storage, >>>>> should help for this use case. >>>> >>>>> The life-cycle of storage is managed with the life-cycle of the >>>>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup >>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself >>>>> is deleted. >>>> >>>>> The userspace map operations can be done by using a cgroup fd as a key >>>>> passed to the lookup, update and delete operations. >>>> >>>> >>>> [..] >>>> >>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup >>>>> local >>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is >>>>> used >>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two >>>>> helpers are named as bpf_cgroup_local_storage_get() and >>>>> bpf_cgroup_local_storage_delete(). >>>> >>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make >>>> cgroup storages shared between programs on the same cgroup") where >>>> the map changes its behavior depending on the key size (see key_size checks >>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still >>>> can be used so we can, in theory, reuse the name.. >>>> >>>> Pros: >>>> - no need for a new map name >>>> >>>> Cons: >>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a >>>> good idea to add more stuff to it? >>>> >>>> But, for the very least, should we also extend >>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've >>>> tried to keep some of the important details in there.. >>> >>> This might be a long shot, but is it possible to switch completely to >>> this new generic cgroup storage, and for programs that attach to >>> cgroups we can still do lookups/allocations during attachment like we >>> do today? IOW, maintain the current API for cgroup progs but switch it >>> to use this new map type instead. >>> >>> It feels like this map type is more generic and can be a superset of >>> the existing cgroup storage, but I feel like I am missing something. >> >> I feel like the biggest issue is that the existing >> bpf_get_local_storage helper is guaranteed to always return non-null >> and the verifier doesn't require the programs to do null checks on it; >> the new helper might return NULL making all existing programs fail the >> verifier. > > What I meant is, keep the old bpf_get_local_storage helper only for > cgroup-attached programs like we have today, and add a new generic > bpf_cgroup_local_storage_get() helper. > > For cgroup-attached programs, make sure a cgroup storage entry is > allocated and hooked to the helper on program attach time, to keep > today's behavior constant. > > For other programs, the bpf_cgroup_local_storage_get() will do the > normal lookup and allocate if necessary. > > Does this make any sense to you? Right. This is what I plan to do. The map will add a flag to distinguish the old and new behavior. > >> >> There might be something else I don't remember at this point (besides >> that weird per-prog_type that we'd have to emulate as well).. > > Yeah there are things that will need to be emulated, but I feel like > we may end up with less confusing code (and less code in general). > >> >>>> >>>>> Signed-off-by: Yonghong Song <yhs@fb.com> >>>>> --- >>>>> include/linux/bpf.h | 3 + >>>>> include/linux/bpf_types.h | 1 + >>>>> include/linux/cgroup-defs.h | 4 + >>>>> include/uapi/linux/bpf.h | 39 +++++ >>>>> kernel/bpf/Makefile | 2 +- >>>>> kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ >>>>> kernel/bpf/helpers.c | 6 + >>>>> kernel/bpf/syscall.c | 3 +- >>>>> kernel/bpf/verifier.c | 14 +- >>>>> kernel/cgroup/cgroup.c | 4 + >>>>> kernel/trace/bpf_trace.c | 4 + >>>>> scripts/bpf_doc.py | 2 + >>>>> tools/include/uapi/linux/bpf.h | 39 +++++ >>>>> 13 files changed, 398 insertions(+), 3 deletions(-) >>>>> create mode 100644 kernel/bpf/bpf_cgroup_storage.c [...] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 20:15 ` Yonghong Song @ 2022-10-17 20:18 ` Yosry Ahmed 0 siblings, 0 replies; 38+ messages in thread From: Yosry Ahmed @ 2022-10-17 20:18 UTC (permalink / raw) To: Yonghong Song Cc: Stanislav Fomichev, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 1:15 PM Yonghong Song <yhs@meta.com> wrote: > > > > On 10/17/22 11:47 AM, Yosry Ahmed wrote: > > On Mon, Oct 17, 2022 at 11:43 AM Stanislav Fomichev <sdf@google.com> wrote: > >> > >> On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >>> > >>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > >>>> > >>>> On 10/13, Yonghong Song wrote: > >>>>> Similar to sk/inode/task storage, implement similar cgroup local storage. > >>>> > >>>>> There already exists a local storage implementation for cgroup-attached > >>>>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > >>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup > >>>>> attached bpf progs wants to access cgroup local storage data. For example, > >>>>> tc egress prog has access to sk and cgroup. It is possible to use > >>>>> sk local storage to emulate cgroup local storage by storing data in > >>>>> socket. > >>>>> But this is a waste as it could be lots of sockets belonging to a > >>>>> particular > >>>>> cgroup. Alternatively, a separate map can be created with cgroup id as > >>>>> the key. > >>>>> But this will introduce additional overhead to manipulate the new map. > >>>>> A cgroup local storage, similar to existing sk/inode/task storage, > >>>>> should help for this use case. > >>>> > >>>>> The life-cycle of storage is managed with the life-cycle of the > >>>>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup > >>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself > >>>>> is deleted. > >>>> > >>>>> The userspace map operations can be done by using a cgroup fd as a key > >>>>> passed to the lookup, update and delete operations. > >>>> > >>>> > >>>> [..] > >>>> > >>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > >>>>> local > >>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > >>>>> used > >>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two > >>>>> helpers are named as bpf_cgroup_local_storage_get() and > >>>>> bpf_cgroup_local_storage_delete(). > >>>> > >>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make > >>>> cgroup storages shared between programs on the same cgroup") where > >>>> the map changes its behavior depending on the key size (see key_size checks > >>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > >>>> can be used so we can, in theory, reuse the name.. > >>>> > >>>> Pros: > >>>> - no need for a new map name > >>>> > >>>> Cons: > >>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > >>>> good idea to add more stuff to it? > >>>> > >>>> But, for the very least, should we also extend > >>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > >>>> tried to keep some of the important details in there.. > >>> > >>> This might be a long shot, but is it possible to switch completely to > >>> this new generic cgroup storage, and for programs that attach to > >>> cgroups we can still do lookups/allocations during attachment like we > >>> do today? IOW, maintain the current API for cgroup progs but switch it > >>> to use this new map type instead. > >>> > >>> It feels like this map type is more generic and can be a superset of > >>> the existing cgroup storage, but I feel like I am missing something. > >> > >> I feel like the biggest issue is that the existing > >> bpf_get_local_storage helper is guaranteed to always return non-null > >> and the verifier doesn't require the programs to do null checks on it; > >> the new helper might return NULL making all existing programs fail the > >> verifier. > > > > What I meant is, keep the old bpf_get_local_storage helper only for > > cgroup-attached programs like we have today, and add a new generic > > bpf_cgroup_local_storage_get() helper. > > > > For cgroup-attached programs, make sure a cgroup storage entry is > > allocated and hooked to the helper on program attach time, to keep > > today's behavior constant. > > > > For other programs, the bpf_cgroup_local_storage_get() will do the > > normal lookup and allocate if necessary. > > > > Does this make any sense to you? > > Right. This is what I plan to do. The map will add a flag to > distinguish the old and new behavior. > This might not make any sense, but is this doable without a flag? Basically extend the new map type so that it has some special behaviors for cgroup attached programs (allocate memory on program attach, bpf_get_local_storage() automatically gets entry for the attached cgroup, etc). > > > >> > >> There might be something else I don't remember at this point (besides > >> that weird per-prog_type that we'd have to emulate as well).. > > > > Yeah there are things that will need to be emulated, but I feel like > > we may end up with less confusing code (and less code in general). > > > >> > >>>> > >>>>> Signed-off-by: Yonghong Song <yhs@fb.com> > >>>>> --- > >>>>> include/linux/bpf.h | 3 + > >>>>> include/linux/bpf_types.h | 1 + > >>>>> include/linux/cgroup-defs.h | 4 + > >>>>> include/uapi/linux/bpf.h | 39 +++++ > >>>>> kernel/bpf/Makefile | 2 +- > >>>>> kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > >>>>> kernel/bpf/helpers.c | 6 + > >>>>> kernel/bpf/syscall.c | 3 +- > >>>>> kernel/bpf/verifier.c | 14 +- > >>>>> kernel/cgroup/cgroup.c | 4 + > >>>>> kernel/trace/bpf_trace.c | 4 + > >>>>> scripts/bpf_doc.py | 2 + > >>>>> tools/include/uapi/linux/bpf.h | 39 +++++ > >>>>> 13 files changed, 398 insertions(+), 3 deletions(-) > >>>>> create mode 100644 kernel/bpf/bpf_cgroup_storage.c > [...] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:43 ` Stanislav Fomichev 2022-10-17 18:47 ` Yosry Ahmed @ 2022-10-17 20:13 ` Yonghong Song 1 sibling, 0 replies; 38+ messages in thread From: Yonghong Song @ 2022-10-17 20:13 UTC (permalink / raw) To: Stanislav Fomichev, Yosry Ahmed Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 11:43 AM, Stanislav Fomichev wrote: > On Mon, Oct 17, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: >>> >>> On 10/13, Yonghong Song wrote: >>>> Similar to sk/inode/task storage, implement similar cgroup local storage. >>> >>>> There already exists a local storage implementation for cgroup-attached >>>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper >>>> bpf_get_local_storage(). But there are use cases such that non-cgroup >>>> attached bpf progs wants to access cgroup local storage data. For example, >>>> tc egress prog has access to sk and cgroup. It is possible to use >>>> sk local storage to emulate cgroup local storage by storing data in >>>> socket. >>>> But this is a waste as it could be lots of sockets belonging to a >>>> particular >>>> cgroup. Alternatively, a separate map can be created with cgroup id as >>>> the key. >>>> But this will introduce additional overhead to manipulate the new map. >>>> A cgroup local storage, similar to existing sk/inode/task storage, >>>> should help for this use case. >>> >>>> The life-cycle of storage is managed with the life-cycle of the >>>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup >>>> with a callback to the bpf_cgroup_storage_free when cgroup itself >>>> is deleted. >>> >>>> The userspace map operations can be done by using a cgroup fd as a key >>>> passed to the lookup, update and delete operations. >>> >>> >>> [..] >>> >>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup >>>> local >>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is >>>> used >>>> for cgroup storage available to non-cgroup-attached bpf programs. The two >>>> helpers are named as bpf_cgroup_local_storage_get() and >>>> bpf_cgroup_local_storage_delete(). >>> >>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make >>> cgroup storages shared between programs on the same cgroup") where >>> the map changes its behavior depending on the key size (see key_size checks >>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still >>> can be used so we can, in theory, reuse the name.. >>> >>> Pros: >>> - no need for a new map name >>> >>> Cons: >>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a >>> good idea to add more stuff to it? >>> >>> But, for the very least, should we also extend >>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've >>> tried to keep some of the important details in there.. >> >> This might be a long shot, but is it possible to switch completely to >> this new generic cgroup storage, and for programs that attach to >> cgroups we can still do lookups/allocations during attachment like we >> do today? IOW, maintain the current API for cgroup progs but switch it >> to use this new map type instead. >> >> It feels like this map type is more generic and can be a superset of >> the existing cgroup storage, but I feel like I am missing something. > > I feel like the biggest issue is that the existing > bpf_get_local_storage helper is guaranteed to always return non-null > and the verifier doesn't require the programs to do null checks on it; > the new helper might return NULL making all existing programs fail the > verifier. Ya, this is indeed the case. Another difference is the new helper is able to access data from different cgroups. and the old helper can only access data from *current* cgroup. > > There might be something else I don't remember at this point (besides > that weird per-prog_type that we'd have to emulate as well).. > >>> >>>> Signed-off-by: Yonghong Song <yhs@fb.com> >>>> --- >>>> include/linux/bpf.h | 3 + >>>> include/linux/bpf_types.h | 1 + >>>> include/linux/cgroup-defs.h | 4 + >>>> include/uapi/linux/bpf.h | 39 +++++ >>>> kernel/bpf/Makefile | 2 +- >>>> kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ >>>> kernel/bpf/helpers.c | 6 + >>>> kernel/bpf/syscall.c | 3 +- >>>> kernel/bpf/verifier.c | 14 +- >>>> kernel/cgroup/cgroup.c | 4 + >>>> kernel/trace/bpf_trace.c | 4 + >>>> scripts/bpf_doc.py | 2 + >>>> tools/include/uapi/linux/bpf.h | 39 +++++ >>>> 13 files changed, 398 insertions(+), 3 deletions(-) >>>> create mode 100644 kernel/bpf/bpf_cgroup_storage.c [...] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:25 ` Yosry Ahmed 2022-10-17 18:43 ` Stanislav Fomichev @ 2022-10-17 20:10 ` Yonghong Song 2022-10-17 20:14 ` Yosry Ahmed 1 sibling, 1 reply; 38+ messages in thread From: Yonghong Song @ 2022-10-17 20:10 UTC (permalink / raw) To: Yosry Ahmed, sdf Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 11:25 AM, Yosry Ahmed wrote: > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: >> >> On 10/13, Yonghong Song wrote: >>> Similar to sk/inode/task storage, implement similar cgroup local storage. >> >>> There already exists a local storage implementation for cgroup-attached >>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper >>> bpf_get_local_storage(). But there are use cases such that non-cgroup >>> attached bpf progs wants to access cgroup local storage data. For example, >>> tc egress prog has access to sk and cgroup. It is possible to use >>> sk local storage to emulate cgroup local storage by storing data in >>> socket. >>> But this is a waste as it could be lots of sockets belonging to a >>> particular >>> cgroup. Alternatively, a separate map can be created with cgroup id as >>> the key. >>> But this will introduce additional overhead to manipulate the new map. >>> A cgroup local storage, similar to existing sk/inode/task storage, >>> should help for this use case. >> >>> The life-cycle of storage is managed with the life-cycle of the >>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup >>> with a callback to the bpf_cgroup_storage_free when cgroup itself >>> is deleted. >> >>> The userspace map operations can be done by using a cgroup fd as a key >>> passed to the lookup, update and delete operations. >> >> >> [..] >> >>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup >>> local >>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is >>> used >>> for cgroup storage available to non-cgroup-attached bpf programs. The two >>> helpers are named as bpf_cgroup_local_storage_get() and >>> bpf_cgroup_local_storage_delete(). >> >> Have you considered doing something similar to 7d9c3427894f ("bpf: Make >> cgroup storages shared between programs on the same cgroup") where >> the map changes its behavior depending on the key size (see key_size checks >> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still >> can be used so we can, in theory, reuse the name.. >> >> Pros: >> - no need for a new map name >> >> Cons: >> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a >> good idea to add more stuff to it? >> >> But, for the very least, should we also extend >> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've >> tried to keep some of the important details in there.. > > This might be a long shot, but is it possible to switch completely to > this new generic cgroup storage, and for programs that attach to > cgroups we can still do lookups/allocations during attachment like we > do today? IOW, maintain the current API for cgroup progs but switch it > to use this new map type instead. Right, cgroup attach/detach should not be impacted by this patch. > > It feels like this map type is more generic and can be a superset of > the existing cgroup storage, but I feel like I am missing something. One difference is old way cgroup local storage allocates the memory at map creation time, and the new way allocates the memory at runtime when get/update helper is called. > >> >>> Signed-off-by: Yonghong Song <yhs@fb.com> >>> --- >>> include/linux/bpf.h | 3 + >>> include/linux/bpf_types.h | 1 + >>> include/linux/cgroup-defs.h | 4 + >>> include/uapi/linux/bpf.h | 39 +++++ >>> kernel/bpf/Makefile | 2 +- >>> kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ >>> kernel/bpf/helpers.c | 6 + >>> kernel/bpf/syscall.c | 3 +- >>> kernel/bpf/verifier.c | 14 +- >>> kernel/cgroup/cgroup.c | 4 + >>> kernel/trace/bpf_trace.c | 4 + >>> scripts/bpf_doc.py | 2 + >>> tools/include/uapi/linux/bpf.h | 39 +++++ >>> 13 files changed, 398 insertions(+), 3 deletions(-) >>> create mode 100644 kernel/bpf/bpf_cgroup_storage.c >> [...] >>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile >>> index 341c94f208f4..b02693f51978 100644 >>> --- a/kernel/bpf/Makefile >>> +++ b/kernel/bpf/Makefile >>> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) >>> obj-$(CONFIG_BPF_SYSCALL) += stackmap.o >>> endif >>> ifeq ($(CONFIG_CGROUPS),y) >>> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o >>> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o >>> endif >>> obj-$(CONFIG_CGROUP_BPF) += cgroup.o >>> ifeq ($(CONFIG_INET),y) >>> diff --git a/kernel/bpf/bpf_cgroup_storage.c >>> b/kernel/bpf/bpf_cgroup_storage.c >>> new file mode 100644 >>> index 000000000000..9974784822da >>> --- /dev/null >>> +++ b/kernel/bpf/bpf_cgroup_storage.c >>> @@ -0,0 +1,280 @@ >>> +// SPDX-License-Identifier: GPL-2.0 >>> +/* >>> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. >>> + */ >>> + >>> +#include <linux/types.h> >>> +#include <linux/bpf.h> >>> +#include <linux/bpf_local_storage.h> >>> +#include <uapi/linux/btf.h> >>> +#include <linux/btf_ids.h> >>> + >>> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); >>> + >>> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); >>> + >>> +static void bpf_cgroup_storage_lock(void) >>> +{ >>> + migrate_disable(); >>> + this_cpu_inc(bpf_cgroup_storage_busy); >>> +} >>> + >>> +static void bpf_cgroup_storage_unlock(void) >>> +{ >>> + this_cpu_dec(bpf_cgroup_storage_busy); >>> + migrate_enable(); >>> +} >>> + >>> +static bool bpf_cgroup_storage_trylock(void) >>> +{ >>> + migrate_disable(); >>> + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { >>> + this_cpu_dec(bpf_cgroup_storage_busy); >>> + migrate_enable(); >>> + return false; >>> + } >>> + return true; >>> +} >> >> Task storage has lock/unlock/trylock; inode storage doesn't; why does >> cgroup need it as well? I think so. the new cgroup local storage might be used in fentry/fexit programs which could cause recursion. >> >>> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) >>> +{ >>> + struct cgroup *cg = owner; >>> + >>> + return &cg->bpf_cgroup_storage; >>> +} >>> + >>> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) >>> +{ >>> + struct bpf_local_storage *local_storage; >>> + struct bpf_local_storage_elem *selem; >>> + bool free_cgroup_storage = false; >>> + struct hlist_node *n; >>> + unsigned long flags; >>> + >>> + rcu_read_lock(); >>> + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); >>> + if (!local_storage) { >>> + rcu_read_unlock(); >>> + return; >>> + } >>> + >>> + /* Neither the bpf_prog nor the bpf-map's syscall >>> + * could be modifying the local_storage->list now. >>> + * Thus, no elem can be added-to or deleted-from the >>> + * local_storage->list by the bpf_prog or by the bpf-map's syscall. >>> + * >>> + * It is racing with bpf_local_storage_map_free() alone >>> + * when unlinking elem from the local_storage->list and >>> + * the map's bucket->list. >>> + */ >>> + bpf_cgroup_storage_lock(); >>> + raw_spin_lock_irqsave(&local_storage->lock, flags); >>> + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { >>> + bpf_selem_unlink_map(selem); >>> + free_cgroup_storage = >>> + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); >>> + } >>> + raw_spin_unlock_irqrestore(&local_storage->lock, flags); >>> + bpf_cgroup_storage_unlock(); >>> + rcu_read_unlock(); >>> + >>> + /* free_cgroup_storage should always be true as long as >>> + * local_storage->list was non-empty. >>> + */ >>> + if (free_cgroup_storage) >>> + kfree_rcu(local_storage, rcu); >>> +} >> >>> +static struct bpf_local_storage_data * >>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool >>> cacheit_lockit) >>> +{ >>> + struct bpf_local_storage *cgroup_storage; >>> + struct bpf_local_storage_map *smap; >>> + >>> + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, >>> + bpf_rcu_lock_held()); >>> + if (!cgroup_storage) >>> + return NULL; >>> + >>> + smap = (struct bpf_local_storage_map *)map; >>> + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); >>> +} >>> + >>> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void >>> *key) >>> +{ >>> + struct bpf_local_storage_data *sdata; >>> + struct cgroup *cgroup; >>> + int fd; >>> + >>> + fd = *(int *)key; >>> + cgroup = cgroup_get_from_fd(fd); >>> + if (IS_ERR(cgroup)) >>> + return ERR_CAST(cgroup); >>> + >>> + bpf_cgroup_storage_lock(); >>> + sdata = cgroup_storage_lookup(cgroup, map, true); >>> + bpf_cgroup_storage_unlock(); >>> + cgroup_put(cgroup); >>> + return sdata ? sdata->data : NULL; >>> +} >> >> A lot of the above (free/lookup) seems to be copy-pasted from the task >> storage; >> any point in trying to generalize the common parts? That is true. Let me think about this. >> >>> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, >>> + void *value, u64 map_flags) >>> +{ >>> + struct bpf_local_storage_data *sdata; >>> + struct cgroup *cgroup; >>> + int err, fd; >>> + >>> + fd = *(int *)key; >>> + cgroup = cgroup_get_from_fd(fd); >>> + if (IS_ERR(cgroup)) >>> + return PTR_ERR(cgroup); >>> + >>> + bpf_cgroup_storage_lock(); >>> + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map >>> *)map, >>> + value, map_flags, GFP_ATOMIC); >>> + bpf_cgroup_storage_unlock(); >>> + err = PTR_ERR_OR_ZERO(sdata); >>> + cgroup_put(cgroup); >>> + return err; >>> +} >>> + [...] >>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >>> index 8ad2c267ff47..2fa2c950c7fb 100644 >>> --- a/kernel/cgroup/cgroup.c >>> +++ b/kernel/cgroup/cgroup.c >>> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) >>> put_css_set_locked(cset->dom_cset); >>> } >> >>> +#ifdef CONFIG_BPF_SYSCALL >>> + bpf_local_cgroup_storage_free(cset->dfl_cgrp); >>> +#endif >>> + > > I am confused about this freeing site. It seems like this path is for > freeing css_set's of task_structs, not for freeing the cgroup itself. > Wouldn't we want to free the local storage when we free the cgroup > itself? Somewhere like css_free_rwork_fn()? or did I completely miss > the point here? Thanks for suggestions here. To be honest, I am not sure whether this location is correct or not. I will look at css_free_rwork_fn() which might be a good place. > >>> kfree_rcu(cset, rcu_head); >>> } >> [...] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 20:10 ` Yonghong Song @ 2022-10-17 20:14 ` Yosry Ahmed 2022-10-17 20:29 ` Yonghong Song 0 siblings, 1 reply; 38+ messages in thread From: Yosry Ahmed @ 2022-10-17 20:14 UTC (permalink / raw) To: Yonghong Song Cc: sdf, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 1:10 PM Yonghong Song <yhs@meta.com> wrote: > > > > On 10/17/22 11:25 AM, Yosry Ahmed wrote: > > On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: > >> > >> On 10/13, Yonghong Song wrote: > >>> Similar to sk/inode/task storage, implement similar cgroup local storage. > >> > >>> There already exists a local storage implementation for cgroup-attached > >>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > >>> bpf_get_local_storage(). But there are use cases such that non-cgroup > >>> attached bpf progs wants to access cgroup local storage data. For example, > >>> tc egress prog has access to sk and cgroup. It is possible to use > >>> sk local storage to emulate cgroup local storage by storing data in > >>> socket. > >>> But this is a waste as it could be lots of sockets belonging to a > >>> particular > >>> cgroup. Alternatively, a separate map can be created with cgroup id as > >>> the key. > >>> But this will introduce additional overhead to manipulate the new map. > >>> A cgroup local storage, similar to existing sk/inode/task storage, > >>> should help for this use case. > >> > >>> The life-cycle of storage is managed with the life-cycle of the > >>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup > >>> with a callback to the bpf_cgroup_storage_free when cgroup itself > >>> is deleted. > >> > >>> The userspace map operations can be done by using a cgroup fd as a key > >>> passed to the lookup, update and delete operations. > >> > >> > >> [..] > >> > >>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup > >>> local > >>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > >>> used > >>> for cgroup storage available to non-cgroup-attached bpf programs. The two > >>> helpers are named as bpf_cgroup_local_storage_get() and > >>> bpf_cgroup_local_storage_delete(). > >> > >> Have you considered doing something similar to 7d9c3427894f ("bpf: Make > >> cgroup storages shared between programs on the same cgroup") where > >> the map changes its behavior depending on the key size (see key_size checks > >> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > >> can be used so we can, in theory, reuse the name.. > >> > >> Pros: > >> - no need for a new map name > >> > >> Cons: > >> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > >> good idea to add more stuff to it? > >> > >> But, for the very least, should we also extend > >> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > >> tried to keep some of the important details in there.. > > > > This might be a long shot, but is it possible to switch completely to > > this new generic cgroup storage, and for programs that attach to > > cgroups we can still do lookups/allocations during attachment like we > > do today? IOW, maintain the current API for cgroup progs but switch it > > to use this new map type instead. > > Right, cgroup attach/detach should not be impacted by this patch. > > > > > It feels like this map type is more generic and can be a superset of > > the existing cgroup storage, but I feel like I am missing something. > > One difference is old way cgroup local storage allocates the memory > at map creation time, and the new way allocates the memory at runtime > when get/update helper is called. > IIUC the old cgroup local storage allocates memory when a program is attached. We can have the same behavior with the new map type, right? When a program is attached to a cgroup, allocate the memory, otherwise it is allocated at run time. Does this make sense? > > > >> > >>> Signed-off-by: Yonghong Song <yhs@fb.com> > >>> --- > >>> include/linux/bpf.h | 3 + > >>> include/linux/bpf_types.h | 1 + > >>> include/linux/cgroup-defs.h | 4 + > >>> include/uapi/linux/bpf.h | 39 +++++ > >>> kernel/bpf/Makefile | 2 +- > >>> kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > >>> kernel/bpf/helpers.c | 6 + > >>> kernel/bpf/syscall.c | 3 +- > >>> kernel/bpf/verifier.c | 14 +- > >>> kernel/cgroup/cgroup.c | 4 + > >>> kernel/trace/bpf_trace.c | 4 + > >>> scripts/bpf_doc.py | 2 + > >>> tools/include/uapi/linux/bpf.h | 39 +++++ > >>> 13 files changed, 398 insertions(+), 3 deletions(-) > >>> create mode 100644 kernel/bpf/bpf_cgroup_storage.c > >> > [...] > >>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > >>> index 341c94f208f4..b02693f51978 100644 > >>> --- a/kernel/bpf/Makefile > >>> +++ b/kernel/bpf/Makefile > >>> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > >>> obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > >>> endif > >>> ifeq ($(CONFIG_CGROUPS),y) > >>> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > >>> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o > >>> endif > >>> obj-$(CONFIG_CGROUP_BPF) += cgroup.o > >>> ifeq ($(CONFIG_INET),y) > >>> diff --git a/kernel/bpf/bpf_cgroup_storage.c > >>> b/kernel/bpf/bpf_cgroup_storage.c > >>> new file mode 100644 > >>> index 000000000000..9974784822da > >>> --- /dev/null > >>> +++ b/kernel/bpf/bpf_cgroup_storage.c > >>> @@ -0,0 +1,280 @@ > >>> +// SPDX-License-Identifier: GPL-2.0 > >>> +/* > >>> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > >>> + */ > >>> + > >>> +#include <linux/types.h> > >>> +#include <linux/bpf.h> > >>> +#include <linux/bpf_local_storage.h> > >>> +#include <uapi/linux/btf.h> > >>> +#include <linux/btf_ids.h> > >>> + > >>> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > >>> + > >>> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); > >>> + > >>> +static void bpf_cgroup_storage_lock(void) > >>> +{ > >>> + migrate_disable(); > >>> + this_cpu_inc(bpf_cgroup_storage_busy); > >>> +} > >>> + > >>> +static void bpf_cgroup_storage_unlock(void) > >>> +{ > >>> + this_cpu_dec(bpf_cgroup_storage_busy); > >>> + migrate_enable(); > >>> +} > >>> + > >>> +static bool bpf_cgroup_storage_trylock(void) > >>> +{ > >>> + migrate_disable(); > >>> + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { > >>> + this_cpu_dec(bpf_cgroup_storage_busy); > >>> + migrate_enable(); > >>> + return false; > >>> + } > >>> + return true; > >>> +} > >> > >> Task storage has lock/unlock/trylock; inode storage doesn't; why does > >> cgroup need it as well? > > I think so. the new cgroup local storage might be used in fentry/fexit > programs which could cause recursion. > > >> > >>> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > >>> +{ > >>> + struct cgroup *cg = owner; > >>> + > >>> + return &cg->bpf_cgroup_storage; > >>> +} > >>> + > >>> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) > >>> +{ > >>> + struct bpf_local_storage *local_storage; > >>> + struct bpf_local_storage_elem *selem; > >>> + bool free_cgroup_storage = false; > >>> + struct hlist_node *n; > >>> + unsigned long flags; > >>> + > >>> + rcu_read_lock(); > >>> + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); > >>> + if (!local_storage) { > >>> + rcu_read_unlock(); > >>> + return; > >>> + } > >>> + > >>> + /* Neither the bpf_prog nor the bpf-map's syscall > >>> + * could be modifying the local_storage->list now. > >>> + * Thus, no elem can be added-to or deleted-from the > >>> + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > >>> + * > >>> + * It is racing with bpf_local_storage_map_free() alone > >>> + * when unlinking elem from the local_storage->list and > >>> + * the map's bucket->list. > >>> + */ > >>> + bpf_cgroup_storage_lock(); > >>> + raw_spin_lock_irqsave(&local_storage->lock, flags); > >>> + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > >>> + bpf_selem_unlink_map(selem); > >>> + free_cgroup_storage = > >>> + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > >>> + } > >>> + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > >>> + bpf_cgroup_storage_unlock(); > >>> + rcu_read_unlock(); > >>> + > >>> + /* free_cgroup_storage should always be true as long as > >>> + * local_storage->list was non-empty. > >>> + */ > >>> + if (free_cgroup_storage) > >>> + kfree_rcu(local_storage, rcu); > >>> +} > >> > >>> +static struct bpf_local_storage_data * > >>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool > >>> cacheit_lockit) > >>> +{ > >>> + struct bpf_local_storage *cgroup_storage; > >>> + struct bpf_local_storage_map *smap; > >>> + > >>> + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, > >>> + bpf_rcu_lock_held()); > >>> + if (!cgroup_storage) > >>> + return NULL; > >>> + > >>> + smap = (struct bpf_local_storage_map *)map; > >>> + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > >>> +} > >>> + > >>> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void > >>> *key) > >>> +{ > >>> + struct bpf_local_storage_data *sdata; > >>> + struct cgroup *cgroup; > >>> + int fd; > >>> + > >>> + fd = *(int *)key; > >>> + cgroup = cgroup_get_from_fd(fd); > >>> + if (IS_ERR(cgroup)) > >>> + return ERR_CAST(cgroup); > >>> + > >>> + bpf_cgroup_storage_lock(); > >>> + sdata = cgroup_storage_lookup(cgroup, map, true); > >>> + bpf_cgroup_storage_unlock(); > >>> + cgroup_put(cgroup); > >>> + return sdata ? sdata->data : NULL; > >>> +} > >> > >> A lot of the above (free/lookup) seems to be copy-pasted from the task > >> storage; > >> any point in trying to generalize the common parts? > > That is true. Let me think about this. > > >> > >>> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, > >>> + void *value, u64 map_flags) > >>> +{ > >>> + struct bpf_local_storage_data *sdata; > >>> + struct cgroup *cgroup; > >>> + int err, fd; > >>> + > >>> + fd = *(int *)key; > >>> + cgroup = cgroup_get_from_fd(fd); > >>> + if (IS_ERR(cgroup)) > >>> + return PTR_ERR(cgroup); > >>> + > >>> + bpf_cgroup_storage_lock(); > >>> + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map > >>> *)map, > >>> + value, map_flags, GFP_ATOMIC); > >>> + bpf_cgroup_storage_unlock(); > >>> + err = PTR_ERR_OR_ZERO(sdata); > >>> + cgroup_put(cgroup); > >>> + return err; > >>> +} > >>> + > [...] > >>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > >>> index 8ad2c267ff47..2fa2c950c7fb 100644 > >>> --- a/kernel/cgroup/cgroup.c > >>> +++ b/kernel/cgroup/cgroup.c > >>> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) > >>> put_css_set_locked(cset->dom_cset); > >>> } > >> > >>> +#ifdef CONFIG_BPF_SYSCALL > >>> + bpf_local_cgroup_storage_free(cset->dfl_cgrp); > >>> +#endif > >>> + > > > > I am confused about this freeing site. It seems like this path is for > > freeing css_set's of task_structs, not for freeing the cgroup itself. > > Wouldn't we want to free the local storage when we free the cgroup > > itself? Somewhere like css_free_rwork_fn()? or did I completely miss > > the point here? > > Thanks for suggestions here. To be honest, I am not sure whether this > location is correct or not. I will look at css_free_rwork_fn() which > might be a good place. > > > > >>> kfree_rcu(cset, rcu_head); > >>> } > >> > [...] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 20:14 ` Yosry Ahmed @ 2022-10-17 20:29 ` Yonghong Song 0 siblings, 0 replies; 38+ messages in thread From: Yonghong Song @ 2022-10-17 20:29 UTC (permalink / raw) To: Yosry Ahmed Cc: sdf, Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 1:14 PM, Yosry Ahmed wrote: > On Mon, Oct 17, 2022 at 1:10 PM Yonghong Song <yhs@meta.com> wrote: >> >> >> >> On 10/17/22 11:25 AM, Yosry Ahmed wrote: >>> On Mon, Oct 17, 2022 at 11:02 AM <sdf@google.com> wrote: >>>> >>>> On 10/13, Yonghong Song wrote: >>>>> Similar to sk/inode/task storage, implement similar cgroup local storage. >>>> >>>>> There already exists a local storage implementation for cgroup-attached >>>>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper >>>>> bpf_get_local_storage(). But there are use cases such that non-cgroup >>>>> attached bpf progs wants to access cgroup local storage data. For example, >>>>> tc egress prog has access to sk and cgroup. It is possible to use >>>>> sk local storage to emulate cgroup local storage by storing data in >>>>> socket. >>>>> But this is a waste as it could be lots of sockets belonging to a >>>>> particular >>>>> cgroup. Alternatively, a separate map can be created with cgroup id as >>>>> the key. >>>>> But this will introduce additional overhead to manipulate the new map. >>>>> A cgroup local storage, similar to existing sk/inode/task storage, >>>>> should help for this use case. >>>> >>>>> The life-cycle of storage is managed with the life-cycle of the >>>>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup >>>>> with a callback to the bpf_cgroup_storage_free when cgroup itself >>>>> is deleted. >>>> >>>>> The userspace map operations can be done by using a cgroup fd as a key >>>>> passed to the lookup, update and delete operations. >>>> >>>> >>>> [..] >>>> >>>>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup >>>>> local >>>>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is >>>>> used >>>>> for cgroup storage available to non-cgroup-attached bpf programs. The two >>>>> helpers are named as bpf_cgroup_local_storage_get() and >>>>> bpf_cgroup_local_storage_delete(). >>>> >>>> Have you considered doing something similar to 7d9c3427894f ("bpf: Make >>>> cgroup storages shared between programs on the same cgroup") where >>>> the map changes its behavior depending on the key size (see key_size checks >>>> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still >>>> can be used so we can, in theory, reuse the name.. >>>> >>>> Pros: >>>> - no need for a new map name >>>> >>>> Cons: >>>> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a >>>> good idea to add more stuff to it? >>>> >>>> But, for the very least, should we also extend >>>> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've >>>> tried to keep some of the important details in there.. >>> >>> This might be a long shot, but is it possible to switch completely to >>> this new generic cgroup storage, and for programs that attach to >>> cgroups we can still do lookups/allocations during attachment like we >>> do today? IOW, maintain the current API for cgroup progs but switch it >>> to use this new map type instead. >> >> Right, cgroup attach/detach should not be impacted by this patch. >> >>> >>> It feels like this map type is more generic and can be a superset of >>> the existing cgroup storage, but I feel like I am missing something. >> >> One difference is old way cgroup local storage allocates the memory >> at map creation time, and the new way allocates the memory at runtime >> when get/update helper is called. >> > > IIUC the old cgroup local storage allocates memory when a program is > attached. Ya, meta data memory is allocated in map creation time but real storage is allocated at attach time. > We can have the same behavior with the new map type, right? > When a program is attached to a cgroup, allocate the memory, otherwise > it is allocated at run time. Does this make sense? I would like to keep the new functionality flexible so that even if a program attaching to a cgroup it can still access other cgroup local storage. > >>> >>>> >>>>> Signed-off-by: Yonghong Song <yhs@fb.com> >>>>> --- >>>>> include/linux/bpf.h | 3 + >>>>> include/linux/bpf_types.h | 1 + >>>>> include/linux/cgroup-defs.h | 4 + >>>>> include/uapi/linux/bpf.h | 39 +++++ >>>>> kernel/bpf/Makefile | 2 +- >>>>> kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ >>>>> kernel/bpf/helpers.c | 6 + >>>>> kernel/bpf/syscall.c | 3 +- >>>>> kernel/bpf/verifier.c | 14 +- >>>>> kernel/cgroup/cgroup.c | 4 + >>>>> kernel/trace/bpf_trace.c | 4 + >>>>> scripts/bpf_doc.py | 2 + >>>>> tools/include/uapi/linux/bpf.h | 39 +++++ >>>>> 13 files changed, 398 insertions(+), 3 deletions(-) >>>>> create mode 100644 kernel/bpf/bpf_cgroup_storage.c >>>> [...] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:01 ` sdf 2022-10-17 18:25 ` Yosry Ahmed @ 2022-10-17 19:23 ` Yonghong Song 2022-10-17 21:03 ` Stanislav Fomichev 2022-10-17 22:26 ` Martin KaFai Lau 2 siblings, 1 reply; 38+ messages in thread From: Yonghong Song @ 2022-10-17 19:23 UTC (permalink / raw) To: sdf, Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 11:01 AM, sdf@google.com wrote: > On 10/13, Yonghong Song wrote: >> Similar to sk/inode/task storage, implement similar cgroup local storage. > >> There already exists a local storage implementation for cgroup-attached >> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper >> bpf_get_local_storage(). But there are use cases such that non-cgroup >> attached bpf progs wants to access cgroup local storage data. For >> example, >> tc egress prog has access to sk and cgroup. It is possible to use >> sk local storage to emulate cgroup local storage by storing data in >> socket. >> But this is a waste as it could be lots of sockets belonging to a >> particular >> cgroup. Alternatively, a separate map can be created with cgroup id as >> the key. >> But this will introduce additional overhead to manipulate the new map. >> A cgroup local storage, similar to existing sk/inode/task storage, >> should help for this use case. > >> The life-cycle of storage is managed with the life-cycle of the >> cgroup struct. i.e. the storage is destroyed along with the owning >> cgroup >> with a callback to the bpf_cgroup_storage_free when cgroup itself >> is deleted. > >> The userspace map operations can be done by using a cgroup fd as a key >> passed to the lookup, update and delete operations. > > > [..] > >> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old >> cgroup local >> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is >> used >> for cgroup storage available to non-cgroup-attached bpf programs. The two >> helpers are named as bpf_cgroup_local_storage_get() and >> bpf_cgroup_local_storage_delete(). > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make > cgroup storages shared between programs on the same cgroup") where > the map changes its behavior depending on the key size (see key_size checks > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > can be used so we can, in theory, reuse the name.. > > Pros: > - no need for a new map name > > Cons: > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > good idea to add more stuff to it? Thinking differently. I think I would have reuse the same map name (BPF_MAP_TYPE_CGROUP_STORAGE) but with a flag like BPF_F_LOCAL_STORAGE_GENERIC). We could use map_extra as well, but I think an explicit flag might be better. > > But, for the very least, should we also extend > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > tried to keep some of the important details in there.. > >> Signed-off-by: Yonghong Song <yhs@fb.com> >> --- >> include/linux/bpf.h | 3 + >> include/linux/bpf_types.h | 1 + >> include/linux/cgroup-defs.h | 4 + >> include/uapi/linux/bpf.h | 39 +++++ >> kernel/bpf/Makefile | 2 +- >> kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ >> kernel/bpf/helpers.c | 6 + >> kernel/bpf/syscall.c | 3 +- >> kernel/bpf/verifier.c | 14 +- >> kernel/cgroup/cgroup.c | 4 + >> kernel/trace/bpf_trace.c | 4 + >> scripts/bpf_doc.py | 2 + >> tools/include/uapi/linux/bpf.h | 39 +++++ >> 13 files changed, 398 insertions(+), 3 deletions(-) >> create mode 100644 kernel/bpf/bpf_cgroup_storage.c > >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index 9e7d46d16032..1395a01c7f18 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > >> const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id >> func_id); >> void bpf_task_storage_free(struct task_struct *task); >> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); >> bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); >> const struct btf_func_model * >> bpf_jit_find_kfunc_model(const struct bpf_prog *prog, >> @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto >> bpf_copy_from_user_task_proto; >> extern const struct bpf_func_proto bpf_set_retval_proto; >> extern const struct bpf_func_proto bpf_get_retval_proto; >> extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; >> +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; >> +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; > >> const struct bpf_func_proto *tracing_prog_func_proto( >> enum bpf_func_id func_id, const struct bpf_prog *prog); >> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h >> index 2c6a4f2562a7..7a0362d7a0aa 100644 >> --- a/include/linux/bpf_types.h >> +++ b/include/linux/bpf_types.h >> @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, >> cgroup_array_map_ops) >> #ifdef CONFIG_CGROUP_BPF >> BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) >> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, >> cgroup_storage_map_ops) >> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, >> cgroup_local_storage_map_ops) >> #endif >> BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) >> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h >> index 4bcf56b3491c..c6f4590dda68 100644 >> --- a/include/linux/cgroup-defs.h >> +++ b/include/linux/cgroup-defs.h >> @@ -504,6 +504,10 @@ struct cgroup { >> /* Used to store internal freezer state */ >> struct cgroup_freezer_state freezer; > >> +#ifdef CONFIG_BPF_SYSCALL >> + struct bpf_local_storage __rcu *bpf_cgroup_storage; >> +#endif >> + >> /* ids of the ancestors at each level including self */ >> u64 ancestor_ids[]; >> }; >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 17f61338f8f8..d918b4054297 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -935,6 +935,7 @@ enum bpf_map_type { >> BPF_MAP_TYPE_TASK_STORAGE, >> BPF_MAP_TYPE_BLOOM_FILTER, >> BPF_MAP_TYPE_USER_RINGBUF, >> + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, >> }; > >> /* Note that tracing related programs such as >> @@ -5435,6 +5436,42 @@ union bpf_attr { >> * **-E2BIG** if user-space has tried to publish a sample >> which is >> * larger than the size of the ring buffer, or which cannot fit >> * within a struct bpf_dynptr. >> + * >> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct >> cgroup *cgroup, void *value, u64 flags) >> + * Description >> + * Get a bpf_local_storage from the *cgroup*. >> + * >> + * Logically, it could be thought of as getting the value from >> + * a *map* with *cgroup* as the **key**. From this >> + * perspective, the usage is not much different from >> + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this >> + * helper enforces the key must be a cgroup struct and the map >> must also >> + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. >> + * >> + * Underneath, the value is stored locally at *cgroup* instead of >> + * the *map*. The *map* is used as the bpf-local-storage >> + * "type". The bpf-local-storage "type" (i.e. the *map*) is >> + * searched against all bpf_local_storage residing at *cgroup*. >> + * >> + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) >> can be >> + * used such that a new bpf_local_storage will be >> + * created if one does not exist. *value* can be used >> + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify >> + * the initial value of a bpf_local_storage. If *value* is >> + * **NULL**, the new bpf_local_storage will be zero initialized. >> + * Return >> + * A bpf_local_storage pointer is returned on success. >> + * >> + * **NULL** if not found or there was an error in adding >> + * a new bpf_local_storage. >> + * >> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct >> cgroup *cgroup) >> + * Description >> + * Delete a bpf_local_storage from a *cgroup*. >> + * Return >> + * 0 on success. >> + * >> + * **-ENOENT** if the bpf_local_storage cannot be found. >> */ >> #define ___BPF_FUNC_MAPPER(FN, ctx...) \ >> FN(unspec, 0, ##ctx) \ >> @@ -5647,6 +5684,8 @@ union bpf_attr { >> FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ >> FN(ktime_get_tai_ns, 208, ##ctx) \ >> FN(user_ringbuf_drain, 209, ##ctx) \ >> + FN(cgroup_local_storage_get, 210, ##ctx) \ >> + FN(cgroup_local_storage_delete, 211, ##ctx) \ >> /* */ > >> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER >> that don't >> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile >> index 341c94f208f4..b02693f51978 100644 >> --- a/kernel/bpf/Makefile >> +++ b/kernel/bpf/Makefile >> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) >> obj-$(CONFIG_BPF_SYSCALL) += stackmap.o >> endif >> ifeq ($(CONFIG_CGROUPS),y) >> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o >> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o >> endif >> obj-$(CONFIG_CGROUP_BPF) += cgroup.o >> ifeq ($(CONFIG_INET),y) >> diff --git a/kernel/bpf/bpf_cgroup_storage.c >> b/kernel/bpf/bpf_cgroup_storage.c >> new file mode 100644 >> index 000000000000..9974784822da >> --- /dev/null >> +++ b/kernel/bpf/bpf_cgroup_storage.c >> @@ -0,0 +1,280 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. >> + */ >> + >> +#include <linux/types.h> >> +#include <linux/bpf.h> >> +#include <linux/bpf_local_storage.h> >> +#include <uapi/linux/btf.h> >> +#include <linux/btf_ids.h> >> + >> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); >> + >> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); >> + >> +static void bpf_cgroup_storage_lock(void) >> +{ >> + migrate_disable(); >> + this_cpu_inc(bpf_cgroup_storage_busy); >> +} >> + >> +static void bpf_cgroup_storage_unlock(void) >> +{ >> + this_cpu_dec(bpf_cgroup_storage_busy); >> + migrate_enable(); >> +} >> + >> +static bool bpf_cgroup_storage_trylock(void) >> +{ >> + migrate_disable(); >> + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { >> + this_cpu_dec(bpf_cgroup_storage_busy); >> + migrate_enable(); >> + return false; >> + } >> + return true; >> +} > > Task storage has lock/unlock/trylock; inode storage doesn't; why does > cgroup need it as well? > >> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) >> +{ >> + struct cgroup *cg = owner; >> + >> + return &cg->bpf_cgroup_storage; >> +} >> + >> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) >> +{ >> + struct bpf_local_storage *local_storage; >> + struct bpf_local_storage_elem *selem; >> + bool free_cgroup_storage = false; >> + struct hlist_node *n; >> + unsigned long flags; >> + >> + rcu_read_lock(); >> + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); >> + if (!local_storage) { >> + rcu_read_unlock(); >> + return; >> + } >> + >> + /* Neither the bpf_prog nor the bpf-map's syscall >> + * could be modifying the local_storage->list now. >> + * Thus, no elem can be added-to or deleted-from the >> + * local_storage->list by the bpf_prog or by the bpf-map's syscall. >> + * >> + * It is racing with bpf_local_storage_map_free() alone >> + * when unlinking elem from the local_storage->list and >> + * the map's bucket->list. >> + */ >> + bpf_cgroup_storage_lock(); >> + raw_spin_lock_irqsave(&local_storage->lock, flags); >> + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { >> + bpf_selem_unlink_map(selem); >> + free_cgroup_storage = >> + bpf_selem_unlink_storage_nolock(local_storage, selem, >> false, false); >> + } >> + raw_spin_unlock_irqrestore(&local_storage->lock, flags); >> + bpf_cgroup_storage_unlock(); >> + rcu_read_unlock(); >> + >> + /* free_cgroup_storage should always be true as long as >> + * local_storage->list was non-empty. >> + */ >> + if (free_cgroup_storage) >> + kfree_rcu(local_storage, rcu); >> +} > >> +static struct bpf_local_storage_data * >> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, >> bool cacheit_lockit) >> +{ >> + struct bpf_local_storage *cgroup_storage; >> + struct bpf_local_storage_map *smap; >> + >> + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, >> + bpf_rcu_lock_held()); >> + if (!cgroup_storage) >> + return NULL; >> + >> + smap = (struct bpf_local_storage_map *)map; >> + return bpf_local_storage_lookup(cgroup_storage, smap, >> cacheit_lockit); >> +} >> + >> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void >> *key) >> +{ >> + struct bpf_local_storage_data *sdata; >> + struct cgroup *cgroup; >> + int fd; >> + >> + fd = *(int *)key; >> + cgroup = cgroup_get_from_fd(fd); >> + if (IS_ERR(cgroup)) >> + return ERR_CAST(cgroup); >> + >> + bpf_cgroup_storage_lock(); >> + sdata = cgroup_storage_lookup(cgroup, map, true); >> + bpf_cgroup_storage_unlock(); >> + cgroup_put(cgroup); >> + return sdata ? sdata->data : NULL; >> +} > > A lot of the above (free/lookup) seems to be copy-pasted from the task > storage; > any point in trying to generalize the common parts? > >> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void >> *key, >> + void *value, u64 map_flags) >> +{ >> + struct bpf_local_storage_data *sdata; >> + struct cgroup *cgroup; >> + int err, fd; >> + >> + fd = *(int *)key; >> + cgroup = cgroup_get_from_fd(fd); >> + if (IS_ERR(cgroup)) >> + return PTR_ERR(cgroup); >> + >> + bpf_cgroup_storage_lock(); >> + sdata = bpf_local_storage_update(cgroup, (struct >> bpf_local_storage_map *)map, >> + value, map_flags, GFP_ATOMIC); >> + bpf_cgroup_storage_unlock(); >> + err = PTR_ERR_OR_ZERO(sdata); >> + cgroup_put(cgroup); >> + return err; >> +} >> + >> +static int cgroup_storage_delete(struct cgroup *cgroup, struct >> bpf_map *map) >> +{ >> + struct bpf_local_storage_data *sdata; >> + >> + sdata = cgroup_storage_lookup(cgroup, map, false); >> + if (!sdata) >> + return -ENOENT; >> + >> + bpf_selem_unlink(SELEM(sdata), true); >> + return 0; >> +} >> + >> +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void >> *key) >> +{ >> + struct cgroup *cgroup; >> + int err, fd; >> + >> + fd = *(int *)key; >> + cgroup = cgroup_get_from_fd(fd); >> + if (IS_ERR(cgroup)) >> + return PTR_ERR(cgroup); >> + >> + bpf_cgroup_storage_lock(); >> + err = cgroup_storage_delete(cgroup, map); >> + bpf_cgroup_storage_unlock(); >> + if (err) >> + return err; >> + >> + cgroup_put(cgroup); >> + return 0; >> +} >> + >> +static int notsupp_get_next_key(struct bpf_map *map, void *key, void >> *next_key) >> +{ >> + return -ENOTSUPP; >> +} >> + >> +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) >> +{ >> + struct bpf_local_storage_map *smap; >> + >> + smap = bpf_local_storage_map_alloc(attr); >> + if (IS_ERR(smap)) >> + return ERR_CAST(smap); >> + >> + smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache); >> + return &smap->map; >> +} >> + >> +static void cgroup_storage_map_free(struct bpf_map *map) >> +{ >> + struct bpf_local_storage_map *smap; >> + >> + smap = (struct bpf_local_storage_map *)map; >> + bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx); >> + bpf_local_storage_map_free(smap, NULL); >> +} >> + >> +/* *gfp_flags* is a hidden argument provided by the verifier */ >> +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct >> cgroup *, cgroup, >> + void *, value, u64, flags, gfp_t, gfp_flags) >> +{ >> + struct bpf_local_storage_data *sdata; >> + >> + WARN_ON_ONCE(!bpf_rcu_lock_held()); >> + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) >> + return (unsigned long)NULL; >> + >> + if (!cgroup) >> + return (unsigned long)NULL; >> + >> + if (!bpf_cgroup_storage_trylock()) >> + return (unsigned long)NULL; >> + >> + sdata = cgroup_storage_lookup(cgroup, map, true); >> + if (sdata) >> + goto unlock; >> + >> + /* only allocate new storage, when the cgroup is refcounted */ >> + if (!percpu_ref_is_dying(&cgroup->self.refcnt) && >> + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) >> + sdata = bpf_local_storage_update(cgroup, (struct >> bpf_local_storage_map *)map, >> + value, BPF_NOEXIST, gfp_flags); >> + >> +unlock: >> + bpf_cgroup_storage_unlock(); >> + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned >> long)sdata->data; >> +} >> + >> +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct >> cgroup *, cgroup) >> +{ >> + int ret; >> + >> + WARN_ON_ONCE(!bpf_rcu_lock_held()); >> + if (!cgroup) >> + return -EINVAL; >> + >> + if (!bpf_cgroup_storage_trylock()) >> + return -EBUSY; >> + >> + ret = cgroup_storage_delete(cgroup, map); >> + bpf_cgroup_storage_unlock(); >> + return ret; >> +} >> + >> +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, >> bpf_local_storage_map) >> +const struct bpf_map_ops cgroup_local_storage_map_ops = { >> + .map_meta_equal = bpf_map_meta_equal, >> + .map_alloc_check = bpf_local_storage_map_alloc_check, >> + .map_alloc = cgroup_storage_map_alloc, >> + .map_free = cgroup_storage_map_free, >> + .map_get_next_key = notsupp_get_next_key, >> + .map_lookup_elem = bpf_cgroup_storage_lookup_elem, >> + .map_update_elem = bpf_cgroup_storage_update_elem, >> + .map_delete_elem = bpf_cgroup_storage_delete_elem, >> + .map_check_btf = bpf_local_storage_map_check_btf, >> + .map_btf_id = &cgroup_storage_map_btf_ids[0], >> + .map_owner_storage_ptr = cgroup_storage_ptr, >> +}; >> + >> +const struct bpf_func_proto bpf_cgroup_storage_get_proto = { >> + .func = bpf_cgroup_storage_get, >> + .gpl_only = false, >> + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, >> + .arg1_type = ARG_CONST_MAP_PTR, >> + .arg2_type = ARG_PTR_TO_BTF_ID, >> + .arg2_btf_id = &bpf_cgroup_btf_id[0], >> + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, >> + .arg4_type = ARG_ANYTHING, >> +}; >> + >> +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = { >> + .func = bpf_cgroup_storage_delete, >> + .gpl_only = false, >> + .ret_type = RET_INTEGER, >> + .arg1_type = ARG_CONST_MAP_PTR, >> + .arg2_type = ARG_PTR_TO_BTF_ID, >> + .arg2_btf_id = &bpf_cgroup_btf_id[0], >> +}; >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c >> index a6b04faed282..5c5bb08832ec 100644 >> --- a/kernel/bpf/helpers.c >> +++ b/kernel/bpf/helpers.c >> @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) >> return &bpf_dynptr_write_proto; >> case BPF_FUNC_dynptr_data: >> return &bpf_dynptr_data_proto; >> +#ifdef CONFIG_CGROUPS >> + case BPF_FUNC_cgroup_local_storage_get: >> + return &bpf_cgroup_storage_get_proto; >> + case BPF_FUNC_cgroup_local_storage_delete: >> + return &bpf_cgroup_storage_delete_proto; >> +#endif >> default: >> break; >> } >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index 7b373a5e861f..e53c7fae6e22 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, >> const struct btf *btf, >> map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && >> map->map_type != BPF_MAP_TYPE_SK_STORAGE && >> map->map_type != BPF_MAP_TYPE_INODE_STORAGE && >> - map->map_type != BPF_MAP_TYPE_TASK_STORAGE) >> + map->map_type != BPF_MAP_TYPE_TASK_STORAGE && >> + map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) >> return -ENOTSUPP; >> if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > >> map->value_size) { >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 6f6d2d511c06..f36f6a3c0d50 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct >> bpf_verifier_env *env, >> func_id != BPF_FUNC_task_storage_delete) >> goto error; >> break; >> + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: >> + if (func_id != BPF_FUNC_cgroup_local_storage_get && >> + func_id != BPF_FUNC_cgroup_local_storage_delete) >> + goto error; >> + break; >> case BPF_MAP_TYPE_BLOOM_FILTER: >> if (func_id != BPF_FUNC_map_peek_elem && >> func_id != BPF_FUNC_map_push_elem) >> @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct >> bpf_verifier_env *env, >> if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) >> goto error; >> break; >> + case BPF_FUNC_cgroup_local_storage_get: >> + case BPF_FUNC_cgroup_local_storage_delete: >> + if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) >> + goto error; >> + break; >> default: >> break; >> } >> @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct >> bpf_verifier_env *env, >> case BPF_MAP_TYPE_INODE_STORAGE: >> case BPF_MAP_TYPE_SK_STORAGE: >> case BPF_MAP_TYPE_TASK_STORAGE: >> + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: >> break; >> default: >> verbose(env, >> @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct >> bpf_verifier_env *env) > >> if (insn->imm == BPF_FUNC_task_storage_get || >> insn->imm == BPF_FUNC_sk_storage_get || >> - insn->imm == BPF_FUNC_inode_storage_get) { >> + insn->imm == BPF_FUNC_inode_storage_get || >> + insn->imm == BPF_FUNC_cgroup_local_storage_get) { >> if (env->prog->aux->sleepable) >> insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force >> __s32)GFP_KERNEL); >> else >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index 8ad2c267ff47..2fa2c950c7fb 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) >> put_css_set_locked(cset->dom_cset); >> } > >> +#ifdef CONFIG_BPF_SYSCALL >> + bpf_local_cgroup_storage_free(cset->dfl_cgrp); >> +#endif >> + >> kfree_rcu(cset, rcu_head); >> } > >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >> index 688552df95ca..179adaae4a9f 100644 >> --- a/kernel/trace/bpf_trace.c >> +++ b/kernel/trace/bpf_trace.c >> @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id >> func_id, const struct bpf_prog *prog) >> return &bpf_get_current_cgroup_id_proto; >> case BPF_FUNC_get_current_ancestor_cgroup_id: >> return &bpf_get_current_ancestor_cgroup_id_proto; >> + case BPF_FUNC_cgroup_local_storage_get: >> + return &bpf_cgroup_storage_get_proto; >> + case BPF_FUNC_cgroup_local_storage_delete: >> + return &bpf_cgroup_storage_delete_proto; >> #endif >> case BPF_FUNC_send_signal: >> return &bpf_send_signal_proto; >> diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py >> index c0e6690be82a..fdb0aff8cb5a 100755 >> --- a/scripts/bpf_doc.py >> +++ b/scripts/bpf_doc.py >> @@ -685,6 +685,7 @@ class PrinterHelpers(Printer): >> 'struct udp6_sock', >> 'struct unix_sock', >> 'struct task_struct', >> + 'struct cgroup', > >> 'struct __sk_buff', >> 'struct sk_msg_md', >> @@ -742,6 +743,7 @@ class PrinterHelpers(Printer): >> 'struct udp6_sock', >> 'struct unix_sock', >> 'struct task_struct', >> + 'struct cgroup', >> 'struct path', >> 'struct btf_ptr', >> 'struct inode', >> diff --git a/tools/include/uapi/linux/bpf.h >> b/tools/include/uapi/linux/bpf.h >> index 17f61338f8f8..d918b4054297 100644 >> --- a/tools/include/uapi/linux/bpf.h >> +++ b/tools/include/uapi/linux/bpf.h >> @@ -935,6 +935,7 @@ enum bpf_map_type { >> BPF_MAP_TYPE_TASK_STORAGE, >> BPF_MAP_TYPE_BLOOM_FILTER, >> BPF_MAP_TYPE_USER_RINGBUF, >> + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, >> }; > >> /* Note that tracing related programs such as >> @@ -5435,6 +5436,42 @@ union bpf_attr { >> * **-E2BIG** if user-space has tried to publish a sample >> which is >> * larger than the size of the ring buffer, or which cannot fit >> * within a struct bpf_dynptr. >> + * >> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct >> cgroup *cgroup, void *value, u64 flags) >> + * Description >> + * Get a bpf_local_storage from the *cgroup*. >> + * >> + * Logically, it could be thought of as getting the value from >> + * a *map* with *cgroup* as the **key**. From this >> + * perspective, the usage is not much different from >> + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this >> + * helper enforces the key must be a cgroup struct and the map >> must also >> + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. >> + * >> + * Underneath, the value is stored locally at *cgroup* instead of >> + * the *map*. The *map* is used as the bpf-local-storage >> + * "type". The bpf-local-storage "type" (i.e. the *map*) is >> + * searched against all bpf_local_storage residing at *cgroup*. >> + * >> + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) >> can be >> + * used such that a new bpf_local_storage will be >> + * created if one does not exist. *value* can be used >> + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify >> + * the initial value of a bpf_local_storage. If *value* is >> + * **NULL**, the new bpf_local_storage will be zero initialized. >> + * Return >> + * A bpf_local_storage pointer is returned on success. >> + * >> + * **NULL** if not found or there was an error in adding >> + * a new bpf_local_storage. >> + * >> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct >> cgroup *cgroup) >> + * Description >> + * Delete a bpf_local_storage from a *cgroup*. >> + * Return >> + * 0 on success. >> + * >> + * **-ENOENT** if the bpf_local_storage cannot be found. >> */ >> #define ___BPF_FUNC_MAPPER(FN, ctx...) \ >> FN(unspec, 0, ##ctx) \ >> @@ -5647,6 +5684,8 @@ union bpf_attr { >> FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ >> FN(ktime_get_tai_ns, 208, ##ctx) \ >> FN(user_ringbuf_drain, 209, ##ctx) \ >> + FN(cgroup_local_storage_get, 210, ##ctx) \ >> + FN(cgroup_local_storage_delete, 211, ##ctx) \ >> /* */ > >> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER >> that don't >> -- >> 2.30.2 > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 19:23 ` Yonghong Song @ 2022-10-17 21:03 ` Stanislav Fomichev 0 siblings, 0 replies; 38+ messages in thread From: Stanislav Fomichev @ 2022-10-17 21:03 UTC (permalink / raw) To: Yonghong Song Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Mon, Oct 17, 2022 at 12:25 PM Yonghong Song <yhs@meta.com> wrote: > > > > On 10/17/22 11:01 AM, sdf@google.com wrote: > > On 10/13, Yonghong Song wrote: > >> Similar to sk/inode/task storage, implement similar cgroup local storage. > > > >> There already exists a local storage implementation for cgroup-attached > >> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > >> bpf_get_local_storage(). But there are use cases such that non-cgroup > >> attached bpf progs wants to access cgroup local storage data. For > >> example, > >> tc egress prog has access to sk and cgroup. It is possible to use > >> sk local storage to emulate cgroup local storage by storing data in > >> socket. > >> But this is a waste as it could be lots of sockets belonging to a > >> particular > >> cgroup. Alternatively, a separate map can be created with cgroup id as > >> the key. > >> But this will introduce additional overhead to manipulate the new map. > >> A cgroup local storage, similar to existing sk/inode/task storage, > >> should help for this use case. > > > >> The life-cycle of storage is managed with the life-cycle of the > >> cgroup struct. i.e. the storage is destroyed along with the owning > >> cgroup > >> with a callback to the bpf_cgroup_storage_free when cgroup itself > >> is deleted. > > > >> The userspace map operations can be done by using a cgroup fd as a key > >> passed to the lookup, update and delete operations. > > > > > > [..] > > > >> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old > >> cgroup local > >> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is > >> used > >> for cgroup storage available to non-cgroup-attached bpf programs. The two > >> helpers are named as bpf_cgroup_local_storage_get() and > >> bpf_cgroup_local_storage_delete(). > > > > Have you considered doing something similar to 7d9c3427894f ("bpf: Make > > cgroup storages shared between programs on the same cgroup") where > > the map changes its behavior depending on the key size (see key_size checks > > in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still > > can be used so we can, in theory, reuse the name.. > > > > Pros: > > - no need for a new map name > > > > Cons: > > - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a > > good idea to add more stuff to it? > > Thinking differently. I think I would have reuse the same map name > (BPF_MAP_TYPE_CGROUP_STORAGE) but with a flag like > BPF_F_LOCAL_STORAGE_GENERIC). > > We could use map_extra as well, but I think an explicit flag might be > better. Ack, flag and map_extra might work as well. They are more explicit, which is good/bad depending on who you talk to. I was assuming that we can just support the following: struct { __uint(type, BPF_MAP_TYPE_CGROUP_STORAGE); __type(key, int); __type(value, xxx); } ...; and depend on key_size == sizeof(int), but up to you; just trying to understand whether it makes sense to share the name or not. Sharing the helper probably not worth it given the special treatment? Or maybe it can be a shortcut to "lookup this map with my cgroup"? > > > > But, for the very least, should we also extend > > Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've > > tried to keep some of the important details in there.. > > > >> Signed-off-by: Yonghong Song <yhs@fb.com> > >> --- > >> include/linux/bpf.h | 3 + > >> include/linux/bpf_types.h | 1 + > >> include/linux/cgroup-defs.h | 4 + > >> include/uapi/linux/bpf.h | 39 +++++ > >> kernel/bpf/Makefile | 2 +- > >> kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > >> kernel/bpf/helpers.c | 6 + > >> kernel/bpf/syscall.c | 3 +- > >> kernel/bpf/verifier.c | 14 +- > >> kernel/cgroup/cgroup.c | 4 + > >> kernel/trace/bpf_trace.c | 4 + > >> scripts/bpf_doc.py | 2 + > >> tools/include/uapi/linux/bpf.h | 39 +++++ > >> 13 files changed, 398 insertions(+), 3 deletions(-) > >> create mode 100644 kernel/bpf/bpf_cgroup_storage.c > > > >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h > >> index 9e7d46d16032..1395a01c7f18 100644 > >> --- a/include/linux/bpf.h > >> +++ b/include/linux/bpf.h > >> @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > > > >> const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id > >> func_id); > >> void bpf_task_storage_free(struct task_struct *task); > >> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); > >> bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); > >> const struct btf_func_model * > >> bpf_jit_find_kfunc_model(const struct bpf_prog *prog, > >> @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto > >> bpf_copy_from_user_task_proto; > >> extern const struct bpf_func_proto bpf_set_retval_proto; > >> extern const struct bpf_func_proto bpf_get_retval_proto; > >> extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > >> +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; > >> +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; > > > >> const struct bpf_func_proto *tracing_prog_func_proto( > >> enum bpf_func_id func_id, const struct bpf_prog *prog); > >> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > >> index 2c6a4f2562a7..7a0362d7a0aa 100644 > >> --- a/include/linux/bpf_types.h > >> +++ b/include/linux/bpf_types.h > >> @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, > >> cgroup_array_map_ops) > >> #ifdef CONFIG_CGROUP_BPF > >> BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) > >> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, > >> cgroup_storage_map_ops) > >> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > >> cgroup_local_storage_map_ops) > >> #endif > >> BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) > >> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) > >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > >> index 4bcf56b3491c..c6f4590dda68 100644 > >> --- a/include/linux/cgroup-defs.h > >> +++ b/include/linux/cgroup-defs.h > >> @@ -504,6 +504,10 @@ struct cgroup { > >> /* Used to store internal freezer state */ > >> struct cgroup_freezer_state freezer; > > > >> +#ifdef CONFIG_BPF_SYSCALL > >> + struct bpf_local_storage __rcu *bpf_cgroup_storage; > >> +#endif > >> + > >> /* ids of the ancestors at each level including self */ > >> u64 ancestor_ids[]; > >> }; > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > >> index 17f61338f8f8..d918b4054297 100644 > >> --- a/include/uapi/linux/bpf.h > >> +++ b/include/uapi/linux/bpf.h > >> @@ -935,6 +935,7 @@ enum bpf_map_type { > >> BPF_MAP_TYPE_TASK_STORAGE, > >> BPF_MAP_TYPE_BLOOM_FILTER, > >> BPF_MAP_TYPE_USER_RINGBUF, > >> + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > >> }; > > > >> /* Note that tracing related programs such as > >> @@ -5435,6 +5436,42 @@ union bpf_attr { > >> * **-E2BIG** if user-space has tried to publish a sample > >> which is > >> * larger than the size of the ring buffer, or which cannot fit > >> * within a struct bpf_dynptr. > >> + * > >> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct > >> cgroup *cgroup, void *value, u64 flags) > >> + * Description > >> + * Get a bpf_local_storage from the *cgroup*. > >> + * > >> + * Logically, it could be thought of as getting the value from > >> + * a *map* with *cgroup* as the **key**. From this > >> + * perspective, the usage is not much different from > >> + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > >> + * helper enforces the key must be a cgroup struct and the map > >> must also > >> + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > >> + * > >> + * Underneath, the value is stored locally at *cgroup* instead of > >> + * the *map*. The *map* is used as the bpf-local-storage > >> + * "type". The bpf-local-storage "type" (i.e. the *map*) is > >> + * searched against all bpf_local_storage residing at *cgroup*. > >> + * > >> + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) > >> can be > >> + * used such that a new bpf_local_storage will be > >> + * created if one does not exist. *value* can be used > >> + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > >> + * the initial value of a bpf_local_storage. If *value* is > >> + * **NULL**, the new bpf_local_storage will be zero initialized. > >> + * Return > >> + * A bpf_local_storage pointer is returned on success. > >> + * > >> + * **NULL** if not found or there was an error in adding > >> + * a new bpf_local_storage. > >> + * > >> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > >> cgroup *cgroup) > >> + * Description > >> + * Delete a bpf_local_storage from a *cgroup*. > >> + * Return > >> + * 0 on success. > >> + * > >> + * **-ENOENT** if the bpf_local_storage cannot be found. > >> */ > >> #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > >> FN(unspec, 0, ##ctx) \ > >> @@ -5647,6 +5684,8 @@ union bpf_attr { > >> FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > >> FN(ktime_get_tai_ns, 208, ##ctx) \ > >> FN(user_ringbuf_drain, 209, ##ctx) \ > >> + FN(cgroup_local_storage_get, 210, ##ctx) \ > >> + FN(cgroup_local_storage_delete, 211, ##ctx) \ > >> /* */ > > > >> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER > >> that don't > >> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > >> index 341c94f208f4..b02693f51978 100644 > >> --- a/kernel/bpf/Makefile > >> +++ b/kernel/bpf/Makefile > >> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > >> obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > >> endif > >> ifeq ($(CONFIG_CGROUPS),y) > >> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > >> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o > >> endif > >> obj-$(CONFIG_CGROUP_BPF) += cgroup.o > >> ifeq ($(CONFIG_INET),y) > >> diff --git a/kernel/bpf/bpf_cgroup_storage.c > >> b/kernel/bpf/bpf_cgroup_storage.c > >> new file mode 100644 > >> index 000000000000..9974784822da > >> --- /dev/null > >> +++ b/kernel/bpf/bpf_cgroup_storage.c > >> @@ -0,0 +1,280 @@ > >> +// SPDX-License-Identifier: GPL-2.0 > >> +/* > >> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > >> + */ > >> + > >> +#include <linux/types.h> > >> +#include <linux/bpf.h> > >> +#include <linux/bpf_local_storage.h> > >> +#include <uapi/linux/btf.h> > >> +#include <linux/btf_ids.h> > >> + > >> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > >> + > >> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); > >> + > >> +static void bpf_cgroup_storage_lock(void) > >> +{ > >> + migrate_disable(); > >> + this_cpu_inc(bpf_cgroup_storage_busy); > >> +} > >> + > >> +static void bpf_cgroup_storage_unlock(void) > >> +{ > >> + this_cpu_dec(bpf_cgroup_storage_busy); > >> + migrate_enable(); > >> +} > >> + > >> +static bool bpf_cgroup_storage_trylock(void) > >> +{ > >> + migrate_disable(); > >> + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { > >> + this_cpu_dec(bpf_cgroup_storage_busy); > >> + migrate_enable(); > >> + return false; > >> + } > >> + return true; > >> +} > > > > Task storage has lock/unlock/trylock; inode storage doesn't; why does > > cgroup need it as well? > > > >> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > >> +{ > >> + struct cgroup *cg = owner; > >> + > >> + return &cg->bpf_cgroup_storage; > >> +} > >> + > >> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) > >> +{ > >> + struct bpf_local_storage *local_storage; > >> + struct bpf_local_storage_elem *selem; > >> + bool free_cgroup_storage = false; > >> + struct hlist_node *n; > >> + unsigned long flags; > >> + > >> + rcu_read_lock(); > >> + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); > >> + if (!local_storage) { > >> + rcu_read_unlock(); > >> + return; > >> + } > >> + > >> + /* Neither the bpf_prog nor the bpf-map's syscall > >> + * could be modifying the local_storage->list now. > >> + * Thus, no elem can be added-to or deleted-from the > >> + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > >> + * > >> + * It is racing with bpf_local_storage_map_free() alone > >> + * when unlinking elem from the local_storage->list and > >> + * the map's bucket->list. > >> + */ > >> + bpf_cgroup_storage_lock(); > >> + raw_spin_lock_irqsave(&local_storage->lock, flags); > >> + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > >> + bpf_selem_unlink_map(selem); > >> + free_cgroup_storage = > >> + bpf_selem_unlink_storage_nolock(local_storage, selem, > >> false, false); > >> + } > >> + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > >> + bpf_cgroup_storage_unlock(); > >> + rcu_read_unlock(); > >> + > >> + /* free_cgroup_storage should always be true as long as > >> + * local_storage->list was non-empty. > >> + */ > >> + if (free_cgroup_storage) > >> + kfree_rcu(local_storage, rcu); > >> +} > > > >> +static struct bpf_local_storage_data * > >> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, > >> bool cacheit_lockit) > >> +{ > >> + struct bpf_local_storage *cgroup_storage; > >> + struct bpf_local_storage_map *smap; > >> + > >> + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, > >> + bpf_rcu_lock_held()); > >> + if (!cgroup_storage) > >> + return NULL; > >> + > >> + smap = (struct bpf_local_storage_map *)map; > >> + return bpf_local_storage_lookup(cgroup_storage, smap, > >> cacheit_lockit); > >> +} > >> + > >> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void > >> *key) > >> +{ > >> + struct bpf_local_storage_data *sdata; > >> + struct cgroup *cgroup; > >> + int fd; > >> + > >> + fd = *(int *)key; > >> + cgroup = cgroup_get_from_fd(fd); > >> + if (IS_ERR(cgroup)) > >> + return ERR_CAST(cgroup); > >> + > >> + bpf_cgroup_storage_lock(); > >> + sdata = cgroup_storage_lookup(cgroup, map, true); > >> + bpf_cgroup_storage_unlock(); > >> + cgroup_put(cgroup); > >> + return sdata ? sdata->data : NULL; > >> +} > > > > A lot of the above (free/lookup) seems to be copy-pasted from the task > > storage; > > any point in trying to generalize the common parts? > > > >> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void > >> *key, > >> + void *value, u64 map_flags) > >> +{ > >> + struct bpf_local_storage_data *sdata; > >> + struct cgroup *cgroup; > >> + int err, fd; > >> + > >> + fd = *(int *)key; > >> + cgroup = cgroup_get_from_fd(fd); > >> + if (IS_ERR(cgroup)) > >> + return PTR_ERR(cgroup); > >> + > >> + bpf_cgroup_storage_lock(); > >> + sdata = bpf_local_storage_update(cgroup, (struct > >> bpf_local_storage_map *)map, > >> + value, map_flags, GFP_ATOMIC); > >> + bpf_cgroup_storage_unlock(); > >> + err = PTR_ERR_OR_ZERO(sdata); > >> + cgroup_put(cgroup); > >> + return err; > >> +} > >> + > >> +static int cgroup_storage_delete(struct cgroup *cgroup, struct > >> bpf_map *map) > >> +{ > >> + struct bpf_local_storage_data *sdata; > >> + > >> + sdata = cgroup_storage_lookup(cgroup, map, false); > >> + if (!sdata) > >> + return -ENOENT; > >> + > >> + bpf_selem_unlink(SELEM(sdata), true); > >> + return 0; > >> +} > >> + > >> +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void > >> *key) > >> +{ > >> + struct cgroup *cgroup; > >> + int err, fd; > >> + > >> + fd = *(int *)key; > >> + cgroup = cgroup_get_from_fd(fd); > >> + if (IS_ERR(cgroup)) > >> + return PTR_ERR(cgroup); > >> + > >> + bpf_cgroup_storage_lock(); > >> + err = cgroup_storage_delete(cgroup, map); > >> + bpf_cgroup_storage_unlock(); > >> + if (err) > >> + return err; > >> + > >> + cgroup_put(cgroup); > >> + return 0; > >> +} > >> + > >> +static int notsupp_get_next_key(struct bpf_map *map, void *key, void > >> *next_key) > >> +{ > >> + return -ENOTSUPP; > >> +} > >> + > >> +static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) > >> +{ > >> + struct bpf_local_storage_map *smap; > >> + > >> + smap = bpf_local_storage_map_alloc(attr); > >> + if (IS_ERR(smap)) > >> + return ERR_CAST(smap); > >> + > >> + smap->cache_idx = bpf_local_storage_cache_idx_get(&cgroup_cache); > >> + return &smap->map; > >> +} > >> + > >> +static void cgroup_storage_map_free(struct bpf_map *map) > >> +{ > >> + struct bpf_local_storage_map *smap; > >> + > >> + smap = (struct bpf_local_storage_map *)map; > >> + bpf_local_storage_cache_idx_free(&cgroup_cache, smap->cache_idx); > >> + bpf_local_storage_map_free(smap, NULL); > >> +} > >> + > >> +/* *gfp_flags* is a hidden argument provided by the verifier */ > >> +BPF_CALL_5(bpf_cgroup_storage_get, struct bpf_map *, map, struct > >> cgroup *, cgroup, > >> + void *, value, u64, flags, gfp_t, gfp_flags) > >> +{ > >> + struct bpf_local_storage_data *sdata; > >> + > >> + WARN_ON_ONCE(!bpf_rcu_lock_held()); > >> + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) > >> + return (unsigned long)NULL; > >> + > >> + if (!cgroup) > >> + return (unsigned long)NULL; > >> + > >> + if (!bpf_cgroup_storage_trylock()) > >> + return (unsigned long)NULL; > >> + > >> + sdata = cgroup_storage_lookup(cgroup, map, true); > >> + if (sdata) > >> + goto unlock; > >> + > >> + /* only allocate new storage, when the cgroup is refcounted */ > >> + if (!percpu_ref_is_dying(&cgroup->self.refcnt) && > >> + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) > >> + sdata = bpf_local_storage_update(cgroup, (struct > >> bpf_local_storage_map *)map, > >> + value, BPF_NOEXIST, gfp_flags); > >> + > >> +unlock: > >> + bpf_cgroup_storage_unlock(); > >> + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned > >> long)sdata->data; > >> +} > >> + > >> +BPF_CALL_2(bpf_cgroup_storage_delete, struct bpf_map *, map, struct > >> cgroup *, cgroup) > >> +{ > >> + int ret; > >> + > >> + WARN_ON_ONCE(!bpf_rcu_lock_held()); > >> + if (!cgroup) > >> + return -EINVAL; > >> + > >> + if (!bpf_cgroup_storage_trylock()) > >> + return -EBUSY; > >> + > >> + ret = cgroup_storage_delete(cgroup, map); > >> + bpf_cgroup_storage_unlock(); > >> + return ret; > >> +} > >> + > >> +BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, > >> bpf_local_storage_map) > >> +const struct bpf_map_ops cgroup_local_storage_map_ops = { > >> + .map_meta_equal = bpf_map_meta_equal, > >> + .map_alloc_check = bpf_local_storage_map_alloc_check, > >> + .map_alloc = cgroup_storage_map_alloc, > >> + .map_free = cgroup_storage_map_free, > >> + .map_get_next_key = notsupp_get_next_key, > >> + .map_lookup_elem = bpf_cgroup_storage_lookup_elem, > >> + .map_update_elem = bpf_cgroup_storage_update_elem, > >> + .map_delete_elem = bpf_cgroup_storage_delete_elem, > >> + .map_check_btf = bpf_local_storage_map_check_btf, > >> + .map_btf_id = &cgroup_storage_map_btf_ids[0], > >> + .map_owner_storage_ptr = cgroup_storage_ptr, > >> +}; > >> + > >> +const struct bpf_func_proto bpf_cgroup_storage_get_proto = { > >> + .func = bpf_cgroup_storage_get, > >> + .gpl_only = false, > >> + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, > >> + .arg1_type = ARG_CONST_MAP_PTR, > >> + .arg2_type = ARG_PTR_TO_BTF_ID, > >> + .arg2_btf_id = &bpf_cgroup_btf_id[0], > >> + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, > >> + .arg4_type = ARG_ANYTHING, > >> +}; > >> + > >> +const struct bpf_func_proto bpf_cgroup_storage_delete_proto = { > >> + .func = bpf_cgroup_storage_delete, > >> + .gpl_only = false, > >> + .ret_type = RET_INTEGER, > >> + .arg1_type = ARG_CONST_MAP_PTR, > >> + .arg2_type = ARG_PTR_TO_BTF_ID, > >> + .arg2_btf_id = &bpf_cgroup_btf_id[0], > >> +}; > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > >> index a6b04faed282..5c5bb08832ec 100644 > >> --- a/kernel/bpf/helpers.c > >> +++ b/kernel/bpf/helpers.c > >> @@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) > >> return &bpf_dynptr_write_proto; > >> case BPF_FUNC_dynptr_data: > >> return &bpf_dynptr_data_proto; > >> +#ifdef CONFIG_CGROUPS > >> + case BPF_FUNC_cgroup_local_storage_get: > >> + return &bpf_cgroup_storage_get_proto; > >> + case BPF_FUNC_cgroup_local_storage_delete: > >> + return &bpf_cgroup_storage_delete_proto; > >> +#endif > >> default: > >> break; > >> } > >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > >> index 7b373a5e861f..e53c7fae6e22 100644 > >> --- a/kernel/bpf/syscall.c > >> +++ b/kernel/bpf/syscall.c > >> @@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, > >> const struct btf *btf, > >> map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && > >> map->map_type != BPF_MAP_TYPE_SK_STORAGE && > >> map->map_type != BPF_MAP_TYPE_INODE_STORAGE && > >> - map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > >> + map->map_type != BPF_MAP_TYPE_TASK_STORAGE && > >> + map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > >> return -ENOTSUPP; > >> if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > > >> map->value_size) { > >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > >> index 6f6d2d511c06..f36f6a3c0d50 100644 > >> --- a/kernel/bpf/verifier.c > >> +++ b/kernel/bpf/verifier.c > >> @@ -6360,6 +6360,11 @@ static int check_map_func_compatibility(struct > >> bpf_verifier_env *env, > >> func_id != BPF_FUNC_task_storage_delete) > >> goto error; > >> break; > >> + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > >> + if (func_id != BPF_FUNC_cgroup_local_storage_get && > >> + func_id != BPF_FUNC_cgroup_local_storage_delete) > >> + goto error; > >> + break; > >> case BPF_MAP_TYPE_BLOOM_FILTER: > >> if (func_id != BPF_FUNC_map_peek_elem && > >> func_id != BPF_FUNC_map_push_elem) > >> @@ -6472,6 +6477,11 @@ static int check_map_func_compatibility(struct > >> bpf_verifier_env *env, > >> if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE) > >> goto error; > >> break; > >> + case BPF_FUNC_cgroup_local_storage_get: > >> + case BPF_FUNC_cgroup_local_storage_delete: > >> + if (map->map_type != BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE) > >> + goto error; > >> + break; > >> default: > >> break; > >> } > >> @@ -12713,6 +12723,7 @@ static int check_map_prog_compatibility(struct > >> bpf_verifier_env *env, > >> case BPF_MAP_TYPE_INODE_STORAGE: > >> case BPF_MAP_TYPE_SK_STORAGE: > >> case BPF_MAP_TYPE_TASK_STORAGE: > >> + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: > >> break; > >> default: > >> verbose(env, > >> @@ -14149,7 +14160,8 @@ static int do_misc_fixups(struct > >> bpf_verifier_env *env) > > > >> if (insn->imm == BPF_FUNC_task_storage_get || > >> insn->imm == BPF_FUNC_sk_storage_get || > >> - insn->imm == BPF_FUNC_inode_storage_get) { > >> + insn->imm == BPF_FUNC_inode_storage_get || > >> + insn->imm == BPF_FUNC_cgroup_local_storage_get) { > >> if (env->prog->aux->sleepable) > >> insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force > >> __s32)GFP_KERNEL); > >> else > >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > >> index 8ad2c267ff47..2fa2c950c7fb 100644 > >> --- a/kernel/cgroup/cgroup.c > >> +++ b/kernel/cgroup/cgroup.c > >> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset) > >> put_css_set_locked(cset->dom_cset); > >> } > > > >> +#ifdef CONFIG_BPF_SYSCALL > >> + bpf_local_cgroup_storage_free(cset->dfl_cgrp); > >> +#endif > >> + > >> kfree_rcu(cset, rcu_head); > >> } > > > >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > >> index 688552df95ca..179adaae4a9f 100644 > >> --- a/kernel/trace/bpf_trace.c > >> +++ b/kernel/trace/bpf_trace.c > >> @@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id > >> func_id, const struct bpf_prog *prog) > >> return &bpf_get_current_cgroup_id_proto; > >> case BPF_FUNC_get_current_ancestor_cgroup_id: > >> return &bpf_get_current_ancestor_cgroup_id_proto; > >> + case BPF_FUNC_cgroup_local_storage_get: > >> + return &bpf_cgroup_storage_get_proto; > >> + case BPF_FUNC_cgroup_local_storage_delete: > >> + return &bpf_cgroup_storage_delete_proto; > >> #endif > >> case BPF_FUNC_send_signal: > >> return &bpf_send_signal_proto; > >> diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py > >> index c0e6690be82a..fdb0aff8cb5a 100755 > >> --- a/scripts/bpf_doc.py > >> +++ b/scripts/bpf_doc.py > >> @@ -685,6 +685,7 @@ class PrinterHelpers(Printer): > >> 'struct udp6_sock', > >> 'struct unix_sock', > >> 'struct task_struct', > >> + 'struct cgroup', > > > >> 'struct __sk_buff', > >> 'struct sk_msg_md', > >> @@ -742,6 +743,7 @@ class PrinterHelpers(Printer): > >> 'struct udp6_sock', > >> 'struct unix_sock', > >> 'struct task_struct', > >> + 'struct cgroup', > >> 'struct path', > >> 'struct btf_ptr', > >> 'struct inode', > >> diff --git a/tools/include/uapi/linux/bpf.h > >> b/tools/include/uapi/linux/bpf.h > >> index 17f61338f8f8..d918b4054297 100644 > >> --- a/tools/include/uapi/linux/bpf.h > >> +++ b/tools/include/uapi/linux/bpf.h > >> @@ -935,6 +935,7 @@ enum bpf_map_type { > >> BPF_MAP_TYPE_TASK_STORAGE, > >> BPF_MAP_TYPE_BLOOM_FILTER, > >> BPF_MAP_TYPE_USER_RINGBUF, > >> + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > >> }; > > > >> /* Note that tracing related programs such as > >> @@ -5435,6 +5436,42 @@ union bpf_attr { > >> * **-E2BIG** if user-space has tried to publish a sample > >> which is > >> * larger than the size of the ring buffer, or which cannot fit > >> * within a struct bpf_dynptr. > >> + * > >> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct > >> cgroup *cgroup, void *value, u64 flags) > >> + * Description > >> + * Get a bpf_local_storage from the *cgroup*. > >> + * > >> + * Logically, it could be thought of as getting the value from > >> + * a *map* with *cgroup* as the **key**. From this > >> + * perspective, the usage is not much different from > >> + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > >> + * helper enforces the key must be a cgroup struct and the map > >> must also > >> + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > >> + * > >> + * Underneath, the value is stored locally at *cgroup* instead of > >> + * the *map*. The *map* is used as the bpf-local-storage > >> + * "type". The bpf-local-storage "type" (i.e. the *map*) is > >> + * searched against all bpf_local_storage residing at *cgroup*. > >> + * > >> + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) > >> can be > >> + * used such that a new bpf_local_storage will be > >> + * created if one does not exist. *value* can be used > >> + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > >> + * the initial value of a bpf_local_storage. If *value* is > >> + * **NULL**, the new bpf_local_storage will be zero initialized. > >> + * Return > >> + * A bpf_local_storage pointer is returned on success. > >> + * > >> + * **NULL** if not found or there was an error in adding > >> + * a new bpf_local_storage. > >> + * > >> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct > >> cgroup *cgroup) > >> + * Description > >> + * Delete a bpf_local_storage from a *cgroup*. > >> + * Return > >> + * 0 on success. > >> + * > >> + * **-ENOENT** if the bpf_local_storage cannot be found. > >> */ > >> #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > >> FN(unspec, 0, ##ctx) \ > >> @@ -5647,6 +5684,8 @@ union bpf_attr { > >> FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > >> FN(ktime_get_tai_ns, 208, ##ctx) \ > >> FN(user_ringbuf_drain, 209, ##ctx) \ > >> + FN(cgroup_local_storage_get, 210, ##ctx) \ > >> + FN(cgroup_local_storage_delete, 211, ##ctx) \ > >> /* */ > > > >> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER > >> that don't > >> -- > >> 2.30.2 > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:01 ` sdf 2022-10-17 18:25 ` Yosry Ahmed 2022-10-17 19:23 ` Yonghong Song @ 2022-10-17 22:26 ` Martin KaFai Lau 2 siblings, 0 replies; 38+ messages in thread From: Martin KaFai Lau @ 2022-10-17 22:26 UTC (permalink / raw) To: sdf Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, Yonghong Song On 10/17/22 11:01 AM, sdf@google.com wrote: >> +static bool bpf_cgroup_storage_trylock(void) >> +{ >> + migrate_disable(); >> + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { >> + this_cpu_dec(bpf_cgroup_storage_busy); >> + migrate_enable(); >> + return false; >> + } >> + return true; >> +} > > Task storage has lock/unlock/trylock; inode storage doesn't; why does > cgroup need it as well? This was added in bc235cdb423a2 to avoid deadlock for tracing program which can get a hold to the same task ptr easily with bpf_get_current_task_btf(). I believe there was no known way to hit this problem in inode storage, so inode storage does not use it. The common tracing use case to get a hold of the cgroup ptr is through task (including bpf_get_current_task_btf()), so it seems to make sense to mimic the trylock here. I have plan to relax it for all non-tracing programs like cgroup-bpf and bpf-lsm. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-14 4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song 2022-10-17 18:01 ` sdf @ 2022-10-17 18:16 ` David Vernet 2022-10-17 19:45 ` Yonghong Song 1 sibling, 1 reply; 38+ messages in thread From: David Vernet @ 2022-10-17 18:16 UTC (permalink / raw) To: Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On Thu, Oct 13, 2022 at 09:56:30PM -0700, Yonghong Song wrote: > Similar to sk/inode/task storage, implement similar cgroup local storage. > > There already exists a local storage implementation for cgroup-attached > bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > bpf_get_local_storage(). But there are use cases such that non-cgroup > attached bpf progs wants to access cgroup local storage data. For example, > tc egress prog has access to sk and cgroup. It is possible to use > sk local storage to emulate cgroup local storage by storing data in socket. > But this is a waste as it could be lots of sockets belonging to a particular > cgroup. Alternatively, a separate map can be created with cgroup id as the key. > But this will introduce additional overhead to manipulate the new map. > A cgroup local storage, similar to existing sk/inode/task storage, > should help for this use case. > > The life-cycle of storage is managed with the life-cycle of the > cgroup struct. i.e. the storage is destroyed along with the owning cgroup > with a callback to the bpf_cgroup_storage_free when cgroup itself > is deleted. > > The userspace map operations can be done by using a cgroup fd as a key > passed to the lookup, update and delete operations. > > Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local > storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is used > for cgroup storage available to non-cgroup-attached bpf programs. The two > helpers are named as bpf_cgroup_local_storage_get() and > bpf_cgroup_local_storage_delete(). > > Signed-off-by: Yonghong Song <yhs@fb.com> > --- > include/linux/bpf.h | 3 + > include/linux/bpf_types.h | 1 + > include/linux/cgroup-defs.h | 4 + > include/uapi/linux/bpf.h | 39 +++++ > kernel/bpf/Makefile | 2 +- > kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ > kernel/bpf/helpers.c | 6 + > kernel/bpf/syscall.c | 3 +- > kernel/bpf/verifier.c | 14 +- > kernel/cgroup/cgroup.c | 4 + > kernel/trace/bpf_trace.c | 4 + > scripts/bpf_doc.py | 2 + > tools/include/uapi/linux/bpf.h | 39 +++++ > 13 files changed, 398 insertions(+), 3 deletions(-) > create mode 100644 kernel/bpf/bpf_cgroup_storage.c > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 9e7d46d16032..1395a01c7f18 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > > const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id); > void bpf_task_storage_free(struct task_struct *task); > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); > bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); > const struct btf_func_model * > bpf_jit_find_kfunc_model(const struct bpf_prog *prog, > @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto bpf_copy_from_user_task_proto; > extern const struct bpf_func_proto bpf_set_retval_proto; > extern const struct bpf_func_proto bpf_get_retval_proto; > extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; > +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; > > const struct bpf_func_proto *tracing_prog_func_proto( > enum bpf_func_id func_id, const struct bpf_prog *prog); > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > index 2c6a4f2562a7..7a0362d7a0aa 100644 > --- a/include/linux/bpf_types.h > +++ b/include/linux/bpf_types.h > @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops) > #ifdef CONFIG_CGROUP_BPF > BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) > +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, cgroup_local_storage_map_ops) Did you mean to compile this out if !CONFIG_CGROUP_BPF? It looks like we're using CONFIG_BPF_SYSCALL elsewhere, which makes sense if we're keeping CONFIG_CGROUP_BPF for programs attaching to cgroups. Or maybe we should put it in CONFIG_CGROUPS, which is what we use when compiling bpf_cgroup_storage.o and the other relevant helpers? Also, would you mind please adding comments here explaining what the difference is between these map types? In terms of readability, BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE are nearly identical, and adding to the confusion, BPF_MAP_TYPE_CGROUP_STORAGE is itself accessed with the bpf_get_local_storage() helper. I feel like we need to be quite verbose about the difference here or users are going to be confused when trying to figure out the differences between these map types. > #endif > BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) > BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 4bcf56b3491c..c6f4590dda68 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -504,6 +504,10 @@ struct cgroup { > /* Used to store internal freezer state */ > struct cgroup_freezer_state freezer; > > +#ifdef CONFIG_BPF_SYSCALL As alluded to above, I assume this should _not_ be: #ifdef CONFIG_CGROUP_BPF Just wanted to highlight it to make sure we're being consistent. > + struct bpf_local_storage __rcu *bpf_cgroup_storage; > +#endif > + > /* ids of the ancestors at each level including self */ > u64 ancestor_ids[]; > }; > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 17f61338f8f8..d918b4054297 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -935,6 +935,7 @@ enum bpf_map_type { > BPF_MAP_TYPE_TASK_STORAGE, > BPF_MAP_TYPE_BLOOM_FILTER, > BPF_MAP_TYPE_USER_RINGBUF, > + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, > }; > > /* Note that tracing related programs such as > @@ -5435,6 +5436,42 @@ union bpf_attr { > * **-E2BIG** if user-space has tried to publish a sample which is > * larger than the size of the ring buffer, or which cannot fit > * within a struct bpf_dynptr. > + * > + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags) I think it will be easy for users to get confused here with bpf_get_local_storage(), which even mentions "cgroup local storage" in the description: 3338 * void *bpf_get_local_storage(void *map, u64 flags) 3339 * Description 3340 * Get the pointer to the local storage area. 3341 * The type and the size of the local storage is defined 3342 * by the *map* argument. 3343 * The *flags* meaning is specific for each map type, 3344 * and has to be 0 for cgroup local storage. It would have been nice if, instead of defining an entirely new helper, we could update enum bpf_cgroup_storage_type to include a third type of cgroup storage, something like: BPF_CGROUP_STORAGE_LOCAL That of course doesn't work for bpf_get_local_storage() though, which doesn't take a struct cgroup * argument. So I think what you're proposing is fine, though I would again suggest that we explicitly spell out the difference between bpf_cgroup_local_storage_get() and bpf_get_local_storage(). Alternatively, do we have any intention of deprecating the older cgroup storage map types? What you're proposing here feels like a more canonical and ergonomic API, so it'd be nice to guide folks towards this as the proper cgroup local storage map at some point. Also, one more nit / thought, but should we change the name to: void *bpf_cgroup_storage_get() This more closely matches the equivalent for task local storage: bpf_task_storage_get(). > + * Description > + * Get a bpf_local_storage from the *cgroup*. > + * > + * Logically, it could be thought of as getting the value from > + * a *map* with *cgroup* as the **key**. From this > + * perspective, the usage is not much different from > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > + * helper enforces the key must be a cgroup struct and the map must also > + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. > + * > + * Underneath, the value is stored locally at *cgroup* instead of > + * the *map*. The *map* is used as the bpf-local-storage > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > + * searched against all bpf_local_storage residing at *cgroup*. > + * > + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > + * used such that a new bpf_local_storage will be > + * created if one does not exist. *value* can be used > + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > + * the initial value of a bpf_local_storage. If *value* is > + * **NULL**, the new bpf_local_storage will be zero initialized. > + * Return > + * A bpf_local_storage pointer is returned on success. > + * > + * **NULL** if not found or there was an error in adding > + * a new bpf_local_storage. > + * > + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct cgroup *cgroup) Same question here r.e. name. Is bpf_cgroup_storage_delete() more consistent with local storage existing helpers such as bpf_task_storage_delete()? > + * Description > + * Delete a bpf_local_storage from a *cgroup*. > + * Return > + * 0 on success. > + * > + * **-ENOENT** if the bpf_local_storage cannot be found. > */ > #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > FN(unspec, 0, ##ctx) \ > @@ -5647,6 +5684,8 @@ union bpf_attr { > FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > FN(ktime_get_tai_ns, 208, ##ctx) \ > FN(user_ringbuf_drain, 209, ##ctx) \ > + FN(cgroup_local_storage_get, 210, ##ctx) \ > + FN(cgroup_local_storage_delete, 211, ##ctx) \ > /* */ > > /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > index 341c94f208f4..b02693f51978 100644 > --- a/kernel/bpf/Makefile > +++ b/kernel/bpf/Makefile > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > endif > ifeq ($(CONFIG_CGROUPS),y) > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o > endif > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > ifeq ($(CONFIG_INET),y) > diff --git a/kernel/bpf/bpf_cgroup_storage.c b/kernel/bpf/bpf_cgroup_storage.c > new file mode 100644 > index 000000000000..9974784822da > --- /dev/null > +++ b/kernel/bpf/bpf_cgroup_storage.c > @@ -0,0 +1,280 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > + */ > + > +#include <linux/types.h> > +#include <linux/bpf.h> > +#include <linux/bpf_local_storage.h> > +#include <uapi/linux/btf.h> > +#include <linux/btf_ids.h> > + > +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > + > +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); > + > +static void bpf_cgroup_storage_lock(void) > +{ > + migrate_disable(); > + this_cpu_inc(bpf_cgroup_storage_busy); > +} > + > +static void bpf_cgroup_storage_unlock(void) > +{ > + this_cpu_dec(bpf_cgroup_storage_busy); > + migrate_enable(); > +} > + > +static bool bpf_cgroup_storage_trylock(void) > +{ > + migrate_disable(); > + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { > + this_cpu_dec(bpf_cgroup_storage_busy); > + migrate_enable(); > + return false; > + } > + return true; > +} > + > +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > +{ > + struct cgroup *cg = owner; > + > + return &cg->bpf_cgroup_storage; > +} > + > +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) > +{ > + struct bpf_local_storage *local_storage; > + struct bpf_local_storage_elem *selem; > + bool free_cgroup_storage = false; > + struct hlist_node *n; > + unsigned long flags; > + > + rcu_read_lock(); > + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); > + if (!local_storage) { > + rcu_read_unlock(); > + return; > + } > + > + /* Neither the bpf_prog nor the bpf-map's syscall > + * could be modifying the local_storage->list now. > + * Thus, no elem can be added-to or deleted-from the > + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > + * > + * It is racing with bpf_local_storage_map_free() alone > + * when unlinking elem from the local_storage->list and > + * the map's bucket->list. > + */ > + bpf_cgroup_storage_lock(); > + raw_spin_lock_irqsave(&local_storage->lock, flags); > + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > + bpf_selem_unlink_map(selem); > + free_cgroup_storage = > + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); Could this overwrite a previously-true free_cgroup_storage if one of these entries is false? Did you mean to do something like this? if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false)) free_cgroup_storage = true; > + } > + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > + bpf_cgroup_storage_unlock(); > + rcu_read_unlock(); > + > + /* free_cgroup_storage should always be true as long as > + * local_storage->list was non-empty. > + */ > + if (free_cgroup_storage) > + kfree_rcu(local_storage, rcu); > +} > + > +static struct bpf_local_storage_data * > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit) > +{ > + struct bpf_local_storage *cgroup_storage; > + struct bpf_local_storage_map *smap; > + > + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, > + bpf_rcu_lock_held()); > + if (!cgroup_storage) > + return NULL; > + > + smap = (struct bpf_local_storage_map *)map; > + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > +} > + > +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void *key) > +{ > + struct bpf_local_storage_data *sdata; > + struct cgroup *cgroup; > + int fd; > + > + fd = *(int *)key; > + cgroup = cgroup_get_from_fd(fd); > + if (IS_ERR(cgroup)) > + return ERR_CAST(cgroup); > + > + bpf_cgroup_storage_lock(); > + sdata = cgroup_storage_lookup(cgroup, map, true); > + bpf_cgroup_storage_unlock(); > + cgroup_put(cgroup); > + return sdata ? sdata->data : NULL; > +} > + > +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, > + void *value, u64 map_flags) > +{ > + struct bpf_local_storage_data *sdata; > + struct cgroup *cgroup; > + int err, fd; > + > + fd = *(int *)key; > + cgroup = cgroup_get_from_fd(fd); > + if (IS_ERR(cgroup)) > + return PTR_ERR(cgroup); > + > + bpf_cgroup_storage_lock(); > + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map, > + value, map_flags, GFP_ATOMIC); > + bpf_cgroup_storage_unlock(); > + err = PTR_ERR_OR_ZERO(sdata); > + cgroup_put(cgroup); > + return err; Optional suggestion, but perhaps this is slightly more concise: bpf_cgroup_storage_unlock(); cgroup_put(cgroup); return PTR_ERR_OR_ZERO(sdata); > +} > + > +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map) > +{ > + struct bpf_local_storage_data *sdata; > + > + sdata = cgroup_storage_lookup(cgroup, map, false); > + if (!sdata) > + return -ENOENT; > + > + bpf_selem_unlink(SELEM(sdata), true); > + return 0; > +} > + > +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key) > +{ > + struct cgroup *cgroup; > + int err, fd; > + > + fd = *(int *)key; > + cgroup = cgroup_get_from_fd(fd); > + if (IS_ERR(cgroup)) > + return PTR_ERR(cgroup); > + > + bpf_cgroup_storage_lock(); > + err = cgroup_storage_delete(cgroup, map); > + bpf_cgroup_storage_unlock(); > + if (err) > + return err; Doesn't this error path leak the cgroup? Maybe this would be cleaner: bpf_cgroup_storage_lock(); err = cgroup_storage_delete(cgroup, map); bpf_cgroup_storage_unlock(); cgroup_put(cgroup); return err; > + > + cgroup_put(cgroup); > + return 0; > +} > + [...] Thanks, David ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs 2022-10-17 18:16 ` David Vernet @ 2022-10-17 19:45 ` Yonghong Song 0 siblings, 0 replies; 38+ messages in thread From: Yonghong Song @ 2022-10-17 19:45 UTC (permalink / raw) To: David Vernet, Yonghong Song Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo On 10/17/22 11:16 AM, David Vernet wrote: > On Thu, Oct 13, 2022 at 09:56:30PM -0700, Yonghong Song wrote: >> Similar to sk/inode/task storage, implement similar cgroup local storage. >> >> There already exists a local storage implementation for cgroup-attached >> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper >> bpf_get_local_storage(). But there are use cases such that non-cgroup >> attached bpf progs wants to access cgroup local storage data. For example, >> tc egress prog has access to sk and cgroup. It is possible to use >> sk local storage to emulate cgroup local storage by storing data in socket. >> But this is a waste as it could be lots of sockets belonging to a particular >> cgroup. Alternatively, a separate map can be created with cgroup id as the key. >> But this will introduce additional overhead to manipulate the new map. >> A cgroup local storage, similar to existing sk/inode/task storage, >> should help for this use case. >> >> The life-cycle of storage is managed with the life-cycle of the >> cgroup struct. i.e. the storage is destroyed along with the owning cgroup >> with a callback to the bpf_cgroup_storage_free when cgroup itself >> is deleted. >> >> The userspace map operations can be done by using a cgroup fd as a key >> passed to the lookup, update and delete operations. >> >> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local >> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is used >> for cgroup storage available to non-cgroup-attached bpf programs. The two >> helpers are named as bpf_cgroup_local_storage_get() and >> bpf_cgroup_local_storage_delete(). >> >> Signed-off-by: Yonghong Song <yhs@fb.com> >> --- >> include/linux/bpf.h | 3 + >> include/linux/bpf_types.h | 1 + >> include/linux/cgroup-defs.h | 4 + >> include/uapi/linux/bpf.h | 39 +++++ >> kernel/bpf/Makefile | 2 +- >> kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++ >> kernel/bpf/helpers.c | 6 + >> kernel/bpf/syscall.c | 3 +- >> kernel/bpf/verifier.c | 14 +- >> kernel/cgroup/cgroup.c | 4 + >> kernel/trace/bpf_trace.c | 4 + >> scripts/bpf_doc.py | 2 + >> tools/include/uapi/linux/bpf.h | 39 +++++ >> 13 files changed, 398 insertions(+), 3 deletions(-) >> create mode 100644 kernel/bpf/bpf_cgroup_storage.c >> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index 9e7d46d16032..1395a01c7f18 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); >> >> const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id); >> void bpf_task_storage_free(struct task_struct *task); >> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup); >> bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); >> const struct btf_func_model * >> bpf_jit_find_kfunc_model(const struct bpf_prog *prog, >> @@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto bpf_copy_from_user_task_proto; >> extern const struct bpf_func_proto bpf_set_retval_proto; >> extern const struct bpf_func_proto bpf_get_retval_proto; >> extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; >> +extern const struct bpf_func_proto bpf_cgroup_storage_get_proto; >> +extern const struct bpf_func_proto bpf_cgroup_storage_delete_proto; >> >> const struct bpf_func_proto *tracing_prog_func_proto( >> enum bpf_func_id func_id, const struct bpf_prog *prog); >> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h >> index 2c6a4f2562a7..7a0362d7a0aa 100644 >> --- a/include/linux/bpf_types.h >> +++ b/include/linux/bpf_types.h >> @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops) >> #ifdef CONFIG_CGROUP_BPF >> BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) >> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) >> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, cgroup_local_storage_map_ops) > > Did you mean to compile this out if !CONFIG_CGROUP_BPF? It looks like > we're using CONFIG_BPF_SYSCALL elsewhere, which makes sense if we're > keeping CONFIG_CGROUP_BPF for programs attaching to cgroups. Or maybe we > should put it in CONFIG_CGROUPS, which is what we use when compiling > bpf_cgroup_storage.o and the other relevant helpers? BPF_MAP_TYPE is defined as #define BPF_MAP_TYPE(_id, _ops) \ extern const struct bpf_map_ops _ops; so it should be okay whether it is guarded by CONFIG_CGROUP_BPF or CONFIG_CGROUPS. I am aware some helper related codes/switch-cases are guarded with CONFIG_CGRUOPS and I just added my helper there as well. But I will double check that CONFIF_CGROUPS && !CONFIG_CGROUP_BPF can compile properly. > > Also, would you mind please adding comments here explaining what the > difference is between these map types? In terms of readability, > BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE are > nearly identical, and adding to the confusion, > BPF_MAP_TYPE_CGROUP_STORAGE is itself accessed with the > bpf_get_local_storage() helper. I feel like we need to be quite verbose > about the difference here or users are going to be confused when trying > to figure out the differences between these map types. Agree. two very similar map names are confusing. I plan to reuse the same map name BPF_MAP_TYPE_CGROUP_STORAGE and add a map-flag to distinghish two use cases. > >> #endif >> BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) >> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h >> index 4bcf56b3491c..c6f4590dda68 100644 >> --- a/include/linux/cgroup-defs.h >> +++ b/include/linux/cgroup-defs.h >> @@ -504,6 +504,10 @@ struct cgroup { >> /* Used to store internal freezer state */ >> struct cgroup_freezer_state freezer; >> >> +#ifdef CONFIG_BPF_SYSCALL > > As alluded to above, I assume this should _not_ be: > > #ifdef CONFIG_CGROUP_BPF > > Just wanted to highlight it to make sure we're being consistent. We should be okay here as config CGROUP_BPF bool "Support for eBPF programs attached to cgroups" depends on BPF_SYSCALL select SOCK_CGROUP_DATA But I can change to CONFIG_CGROUP_BPF. > >> + struct bpf_local_storage __rcu *bpf_cgroup_storage; >> +#endif >> + >> /* ids of the ancestors at each level including self */ >> u64 ancestor_ids[]; >> }; >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 17f61338f8f8..d918b4054297 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -935,6 +935,7 @@ enum bpf_map_type { >> BPF_MAP_TYPE_TASK_STORAGE, >> BPF_MAP_TYPE_BLOOM_FILTER, >> BPF_MAP_TYPE_USER_RINGBUF, >> + BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE, >> }; >> >> /* Note that tracing related programs such as >> @@ -5435,6 +5436,42 @@ union bpf_attr { >> * **-E2BIG** if user-space has tried to publish a sample which is >> * larger than the size of the ring buffer, or which cannot fit >> * within a struct bpf_dynptr. >> + * >> + * void *bpf_cgroup_local_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags) > > I think it will be easy for users to get confused here with > bpf_get_local_storage(), which even mentions "cgroup local storage" in > the description: > > 3338 * void *bpf_get_local_storage(void *map, u64 flags) > 3339 * Description > 3340 * Get the pointer to the local storage area. > 3341 * The type and the size of the local storage is defined > 3342 * by the *map* argument. > 3343 * The *flags* meaning is specific for each map type, > 3344 * and has to be 0 for cgroup local storage. > > It would have been nice if, instead of defining an entirely new helper, > we could update enum bpf_cgroup_storage_type to include a third type of > cgroup storage, something like: > > BPF_CGROUP_STORAGE_LOCAL > > That of course doesn't work for bpf_get_local_storage() though, which > doesn't take a struct cgroup * argument. So I think what you're > proposing is fine, though I would again suggest that we explicitly spell > out the difference between bpf_cgroup_local_storage_get() and > bpf_get_local_storage(). Alternatively, do we have any intention of > deprecating the older cgroup storage map types? What you're proposing > here feels like a more canonical and ergonomic API, so it'd be nice to > guide folks towards this as the proper cgroup local storage map at some > point. > > Also, one more nit / thought, but should we change the name to: > > void *bpf_cgroup_storage_get() Ya, I plan to use this in the next revision. Basically bpf_cgroup_storage_get/delete() can be used if flag BPF_F_LOCAL_STORAGE_GENERIC is specified. If the flag BPF_F_LOCAL_STORAGE_GENERIC is not specified, the helper bpf_get_local_storage() can be used. > > This more closely matches the equivalent for task local storage: > bpf_task_storage_get(). > >> + * Description >> + * Get a bpf_local_storage from the *cgroup*. >> + * >> + * Logically, it could be thought of as getting the value from >> + * a *map* with *cgroup* as the **key**. From this >> + * perspective, the usage is not much different from >> + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this >> + * helper enforces the key must be a cgroup struct and the map must also >> + * be a **BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE**. >> + * >> + * Underneath, the value is stored locally at *cgroup* instead of >> + * the *map*. The *map* is used as the bpf-local-storage >> + * "type". The bpf-local-storage "type" (i.e. the *map*) is >> + * searched against all bpf_local_storage residing at *cgroup*. >> + * >> + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be >> + * used such that a new bpf_local_storage will be >> + * created if one does not exist. *value* can be used >> + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify >> + * the initial value of a bpf_local_storage. If *value* is >> + * **NULL**, the new bpf_local_storage will be zero initialized. >> + * Return >> + * A bpf_local_storage pointer is returned on success. >> + * >> + * **NULL** if not found or there was an error in adding >> + * a new bpf_local_storage. >> + * >> + * long bpf_cgroup_local_storage_delete(struct bpf_map *map, struct cgroup *cgroup) > > Same question here r.e. name. Is bpf_cgroup_storage_delete() more > consistent with local storage existing helpers such as > bpf_task_storage_delete()? > >> + * Description >> + * Delete a bpf_local_storage from a *cgroup*. >> + * Return >> + * 0 on success. >> + * >> + * **-ENOENT** if the bpf_local_storage cannot be found. >> */ >> #define ___BPF_FUNC_MAPPER(FN, ctx...) \ >> FN(unspec, 0, ##ctx) \ >> @@ -5647,6 +5684,8 @@ union bpf_attr { >> FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ >> FN(ktime_get_tai_ns, 208, ##ctx) \ >> FN(user_ringbuf_drain, 209, ##ctx) \ >> + FN(cgroup_local_storage_get, 210, ##ctx) \ >> + FN(cgroup_local_storage_delete, 211, ##ctx) \ >> /* */ >> >> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't >> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile >> index 341c94f208f4..b02693f51978 100644 >> --- a/kernel/bpf/Makefile >> +++ b/kernel/bpf/Makefile >> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) >> obj-$(CONFIG_BPF_SYSCALL) += stackmap.o >> endif >> ifeq ($(CONFIG_CGROUPS),y) >> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o >> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o >> endif >> obj-$(CONFIG_CGROUP_BPF) += cgroup.o >> ifeq ($(CONFIG_INET),y) >> diff --git a/kernel/bpf/bpf_cgroup_storage.c b/kernel/bpf/bpf_cgroup_storage.c >> new file mode 100644 >> index 000000000000..9974784822da >> --- /dev/null >> +++ b/kernel/bpf/bpf_cgroup_storage.c >> @@ -0,0 +1,280 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. >> + */ >> + >> +#include <linux/types.h> >> +#include <linux/bpf.h> >> +#include <linux/bpf_local_storage.h> >> +#include <uapi/linux/btf.h> >> +#include <linux/btf_ids.h> >> + >> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); >> + >> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy); >> + >> +static void bpf_cgroup_storage_lock(void) >> +{ >> + migrate_disable(); >> + this_cpu_inc(bpf_cgroup_storage_busy); >> +} >> + >> +static void bpf_cgroup_storage_unlock(void) >> +{ >> + this_cpu_dec(bpf_cgroup_storage_busy); >> + migrate_enable(); >> +} >> + >> +static bool bpf_cgroup_storage_trylock(void) >> +{ >> + migrate_disable(); >> + if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) { >> + this_cpu_dec(bpf_cgroup_storage_busy); >> + migrate_enable(); >> + return false; >> + } >> + return true; >> +} >> + >> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) >> +{ >> + struct cgroup *cg = owner; >> + >> + return &cg->bpf_cgroup_storage; >> +} >> + >> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup) >> +{ >> + struct bpf_local_storage *local_storage; >> + struct bpf_local_storage_elem *selem; >> + bool free_cgroup_storage = false; >> + struct hlist_node *n; >> + unsigned long flags; >> + >> + rcu_read_lock(); >> + local_storage = rcu_dereference(cgroup->bpf_cgroup_storage); >> + if (!local_storage) { >> + rcu_read_unlock(); >> + return; >> + } >> + >> + /* Neither the bpf_prog nor the bpf-map's syscall >> + * could be modifying the local_storage->list now. >> + * Thus, no elem can be added-to or deleted-from the >> + * local_storage->list by the bpf_prog or by the bpf-map's syscall. >> + * >> + * It is racing with bpf_local_storage_map_free() alone >> + * when unlinking elem from the local_storage->list and >> + * the map's bucket->list. >> + */ >> + bpf_cgroup_storage_lock(); >> + raw_spin_lock_irqsave(&local_storage->lock, flags); >> + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { >> + bpf_selem_unlink_map(selem); >> + free_cgroup_storage = >> + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > > Could this overwrite a previously-true free_cgroup_storage if one of > these entries is false? Did you mean to do something like this? I will add a comment here. This should not be the case. > > if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false)) > free_cgroup_storage = true; > >> + } >> + raw_spin_unlock_irqrestore(&local_storage->lock, flags); >> + bpf_cgroup_storage_unlock(); >> + rcu_read_unlock(); >> + >> + /* free_cgroup_storage should always be true as long as >> + * local_storage->list was non-empty. >> + */ >> + if (free_cgroup_storage) >> + kfree_rcu(local_storage, rcu); >> +} >> + >> +static struct bpf_local_storage_data * >> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit) >> +{ >> + struct bpf_local_storage *cgroup_storage; >> + struct bpf_local_storage_map *smap; >> + >> + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage, >> + bpf_rcu_lock_held()); >> + if (!cgroup_storage) >> + return NULL; >> + >> + smap = (struct bpf_local_storage_map *)map; >> + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); >> +} >> + >> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void *key) >> +{ >> + struct bpf_local_storage_data *sdata; >> + struct cgroup *cgroup; >> + int fd; >> + >> + fd = *(int *)key; >> + cgroup = cgroup_get_from_fd(fd); >> + if (IS_ERR(cgroup)) >> + return ERR_CAST(cgroup); >> + >> + bpf_cgroup_storage_lock(); >> + sdata = cgroup_storage_lookup(cgroup, map, true); >> + bpf_cgroup_storage_unlock(); >> + cgroup_put(cgroup); >> + return sdata ? sdata->data : NULL; >> +} >> + >> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key, >> + void *value, u64 map_flags) >> +{ >> + struct bpf_local_storage_data *sdata; >> + struct cgroup *cgroup; >> + int err, fd; >> + >> + fd = *(int *)key; >> + cgroup = cgroup_get_from_fd(fd); >> + if (IS_ERR(cgroup)) >> + return PTR_ERR(cgroup); >> + >> + bpf_cgroup_storage_lock(); >> + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map, >> + value, map_flags, GFP_ATOMIC); >> + bpf_cgroup_storage_unlock(); >> + err = PTR_ERR_OR_ZERO(sdata); >> + cgroup_put(cgroup); >> + return err; > > Optional suggestion, but perhaps this is slightly more concise: > > bpf_cgroup_storage_unlock(); > cgroup_put(cgroup); > return PTR_ERR_OR_ZERO(sdata); Good idea. Will do. > >> +} >> + >> +static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map) >> +{ >> + struct bpf_local_storage_data *sdata; >> + >> + sdata = cgroup_storage_lookup(cgroup, map, false); >> + if (!sdata) >> + return -ENOENT; >> + >> + bpf_selem_unlink(SELEM(sdata), true); >> + return 0; >> +} >> + >> +static int bpf_cgroup_storage_delete_elem(struct bpf_map *map, void *key) >> +{ >> + struct cgroup *cgroup; >> + int err, fd; >> + >> + fd = *(int *)key; >> + cgroup = cgroup_get_from_fd(fd); >> + if (IS_ERR(cgroup)) >> + return PTR_ERR(cgroup); >> + >> + bpf_cgroup_storage_lock(); >> + err = cgroup_storage_delete(cgroup, map); >> + bpf_cgroup_storage_unlock(); >> + if (err) >> + return err; > > Doesn't this error path leak the cgroup? Maybe this would be cleaner: > > bpf_cgroup_storage_lock(); > err = cgroup_storage_delete(cgroup, map); > bpf_cgroup_storage_unlock(); > cgroup_put(cgroup); > > return err; Thanks for spotting this. Yes, 'return err' here will cause a cgrup reference leaking. > >> + >> + cgroup_put(cgroup); >> + return 0; >> +} >> + > > [...] > > Thanks, > David ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH bpf-next 3/5] libbpf: Support new cgroup local storage 2022-10-14 4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song @ 2022-10-14 4:56 ` Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 4/5] bpftool: " Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 5/5] selftests/bpf: Add selftests for " Yonghong Song 4 siblings, 0 replies; 38+ messages in thread From: Yonghong Song @ 2022-10-14 4:56 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo Add support for new cgroup local storage. Signed-off-by: Yonghong Song <yhs@fb.com> --- tools/lib/bpf/libbpf.c | 1 + tools/lib/bpf/libbpf_probes.c | 1 + 2 files changed, 2 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 8c3f236c86e4..81359eeb5104 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -164,6 +164,7 @@ static const char * const map_type_name[] = { [BPF_MAP_TYPE_TASK_STORAGE] = "task_storage", [BPF_MAP_TYPE_BLOOM_FILTER] = "bloom_filter", [BPF_MAP_TYPE_USER_RINGBUF] = "user_ringbuf", + [BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE] = "cgroup_local_storage", }; static const char * const prog_type_name[] = { diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c index f3a8e8e74eb8..e424de977007 100644 --- a/tools/lib/bpf/libbpf_probes.c +++ b/tools/lib/bpf/libbpf_probes.c @@ -221,6 +221,7 @@ static int probe_map_create(enum bpf_map_type map_type) case BPF_MAP_TYPE_SK_STORAGE: case BPF_MAP_TYPE_INODE_STORAGE: case BPF_MAP_TYPE_TASK_STORAGE: + case BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE: btf_key_type_id = 1; btf_value_type_id = 3; value_size = 8; -- 2.30.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH bpf-next 4/5] bpftool: Support new cgroup local storage 2022-10-14 4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song ` (2 preceding siblings ...) 2022-10-14 4:56 ` [PATCH bpf-next 3/5] libbpf: Support new cgroup local storage Yonghong Song @ 2022-10-14 4:56 ` Yonghong Song 2022-10-17 10:26 ` Quentin Monnet 2022-10-14 4:56 ` [PATCH bpf-next 5/5] selftests/bpf: Add selftests for " Yonghong Song 4 siblings, 1 reply; 38+ messages in thread From: Yonghong Song @ 2022-10-14 4:56 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo Add support for new cgroup local storage Signed-off-by: Yonghong Song <yhs@fb.com> --- tools/bpf/bpftool/Documentation/bpftool-map.rst | 2 +- tools/bpf/bpftool/map.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst index 7f3b67a8b48f..4c591b10961e 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst @@ -55,7 +55,7 @@ MAP COMMANDS | | **devmap** | **devmap_hash** | **sockmap** | **cpumap** | **xskmap** | **sockhash** | | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage** | | **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** | **inode_storage** -| | **task_storage** | **bloom_filter** | **user_ringbuf** } +| | **task_storage** | **bloom_filter** | **user_ringbuf** | **cgroup_local_storage** } DESCRIPTION =========== diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c index 9a6ca9f31133..ab681dc65316 100644 --- a/tools/bpf/bpftool/map.c +++ b/tools/bpf/bpftool/map.c @@ -1459,7 +1459,7 @@ static int do_help(int argc, char **argv) " devmap | devmap_hash | sockmap | cpumap | xskmap | sockhash |\n" " cgroup_storage | reuseport_sockarray | percpu_cgroup_storage |\n" " queue | stack | sk_storage | struct_ops | ringbuf | inode_storage |\n" - " task_storage | bloom_filter | user_ringbuf }\n" + " task_storage | bloom_filter | user_ringbuf | cgroup_local_storage }\n" " " HELP_SPEC_OPTIONS " |\n" " {-f|--bpffs} | {-n|--nomount} }\n" "", -- 2.30.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH bpf-next 4/5] bpftool: Support new cgroup local storage 2022-10-14 4:56 ` [PATCH bpf-next 4/5] bpftool: " Yonghong Song @ 2022-10-17 10:26 ` Quentin Monnet 0 siblings, 0 replies; 38+ messages in thread From: Quentin Monnet @ 2022-10-17 10:26 UTC (permalink / raw) To: Yonghong Song, bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo 2022-10-13 21:56 UTC-0700 ~ Yonghong Song <yhs@fb.com> > Add support for new cgroup local storage > > Signed-off-by: Yonghong Song <yhs@fb.com> > --- > tools/bpf/bpftool/Documentation/bpftool-map.rst | 2 +- > tools/bpf/bpftool/map.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst > index 7f3b67a8b48f..4c591b10961e 100644 > --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst > +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst > @@ -55,7 +55,7 @@ MAP COMMANDS > | | **devmap** | **devmap_hash** | **sockmap** | **cpumap** | **xskmap** | **sockhash** > | | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage** > | | **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** | **inode_storage** > -| | **task_storage** | **bloom_filter** | **user_ringbuf** } > +| | **task_storage** | **bloom_filter** | **user_ringbuf** | **cgroup_local_storage** } > > DESCRIPTION > =========== > diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c > index 9a6ca9f31133..ab681dc65316 100644 > --- a/tools/bpf/bpftool/map.c > +++ b/tools/bpf/bpftool/map.c > @@ -1459,7 +1459,7 @@ static int do_help(int argc, char **argv) > " devmap | devmap_hash | sockmap | cpumap | xskmap | sockhash |\n" > " cgroup_storage | reuseport_sockarray | percpu_cgroup_storage |\n" > " queue | stack | sk_storage | struct_ops | ringbuf | inode_storage |\n" > - " task_storage | bloom_filter | user_ringbuf }\n" > + " task_storage | bloom_filter | user_ringbuf | cgroup_local_storage }\n" > " " HELP_SPEC_OPTIONS " |\n" > " {-f|--bpffs} | {-n|--nomount} }\n" > "", Thanks for the bpftool update! Acked-by: Quentin Monnet <quentin@isovalent.com> ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH bpf-next 5/5] selftests/bpf: Add selftests for cgroup local storage 2022-10-14 4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song ` (3 preceding siblings ...) 2022-10-14 4:56 ` [PATCH bpf-next 4/5] bpftool: " Yonghong Song @ 2022-10-14 4:56 ` Yonghong Song 4 siblings, 0 replies; 38+ messages in thread From: Yonghong Song @ 2022-10-14 4:56 UTC (permalink / raw) To: bpf Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo Add two tests for cgroup local storage, one to test bpf program helpers and user space map APIs, and the other to test recursive fentry triggering won't deadlock. Signed-off-by: Yonghong Song <yhs@fb.com> --- .../bpf/prog_tests/cgroup_local_storage.c | 92 +++++++++++++++++++ .../bpf/progs/cgroup_local_storage.c | 88 ++++++++++++++++++ .../selftests/bpf/progs/cgroup_ls_recursion.c | 70 ++++++++++++++ 3 files changed, 250 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c create mode 100644 tools/testing/selftests/bpf/progs/cgroup_local_storage.c create mode 100644 tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c b/tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c new file mode 100644 index 000000000000..4fe8862d275c --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_local_storage.c @@ -0,0 +1,92 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2021 Facebook */ + +#define _GNU_SOURCE /* See feature_test_macros(7) */ +#include <unistd.h> +#include <sys/syscall.h> /* For SYS_xxx definitions */ +#include <sys/types.h> +#include <test_progs.h> +#include "cgroup_local_storage.skel.h" +#include "cgroup_ls_recursion.skel.h" + +static void test_sys_enter_exit(int cgroup_fd) +{ + struct cgroup_local_storage *skel; + long val1 = 1, val2 = 0; + int err; + + skel = cgroup_local_storage__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_open_and_load")) + return; + + /* populate a value in cg_storage_2 */ + err = bpf_map_update_elem(bpf_map__fd(skel->maps.cg_storage_2), &cgroup_fd, &val1, BPF_ANY); + if (!ASSERT_OK(err, "map_update_elem")) + goto out; + + /* check value */ + err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.cg_storage_2), &cgroup_fd, &val2); + if (!ASSERT_OK(err, "map_lookup_elem")) + goto out; + if (!ASSERT_EQ(val2, 1, "map_lookup_elem, invalid val")) + goto out; + + /* delete value */ + err = bpf_map_delete_elem(bpf_map__fd(skel->maps.cg_storage_2), &cgroup_fd); + if (!ASSERT_OK(err, "map_delete_elem")) + goto out; + + skel->bss->target_pid = syscall(SYS_gettid); + + err = cgroup_local_storage__attach(skel); + if (!ASSERT_OK(err, "skel_attach")) + goto out; + + syscall(SYS_gettid); + syscall(SYS_gettid); + + skel->bss->target_pid = 0; + + /* 3x syscalls: 1x attach and 2x gettid */ + ASSERT_EQ(skel->bss->enter_cnt, 3, "enter_cnt"); + ASSERT_EQ(skel->bss->exit_cnt, 3, "exit_cnt"); + ASSERT_EQ(skel->bss->mismatch_cnt, 0, "mismatch_cnt"); +out: + cgroup_local_storage__destroy(skel); +} + +static void test_recursion(int cgroup_fd) +{ + struct cgroup_ls_recursion *skel; + int err; + + skel = cgroup_ls_recursion__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_open_and_load")) + return; + + err = cgroup_ls_recursion__attach(skel); + if (!ASSERT_OK(err, "skel_attach")) + goto out; + + /* trigger sys_enter, make sure it does not cause deadlock */ + syscall(SYS_gettid); + +out: + cgroup_ls_recursion__destroy(skel); +} + +void test_cgroup_local_storage(void) +{ + int cgroup_fd; + + cgroup_fd = test__join_cgroup("/cgroup_local_storage"); + if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /cgroup_local_storage")) + return; + + if (test__start_subtest("sys_enter_exit")) + test_sys_enter_exit(cgroup_fd); + if (test__start_subtest("recursion")) + test_recursion(cgroup_fd); + + close(cgroup_fd); +} diff --git a/tools/testing/selftests/bpf/progs/cgroup_local_storage.c b/tools/testing/selftests/bpf/progs/cgroup_local_storage.c new file mode 100644 index 000000000000..5098e99705c6 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/cgroup_local_storage.c @@ -0,0 +1,88 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +struct { + __uint(type, BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, long); +} cg_storage_1 SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, long); +} cg_storage_2 SEC(".maps"); + +#define MAGIC_VALUE 0xabcd1234 + +pid_t target_pid = 0; +int mismatch_cnt = 0; +int enter_cnt = 0; +int exit_cnt = 0; + +SEC("tp_btf/sys_enter") +int BPF_PROG(on_enter, struct pt_regs *regs, long id) +{ + struct task_struct *task; + long *ptr; + int err; + + task = bpf_get_current_task_btf(); + if (task->pid != target_pid) + return 0; + + /* populate value 0 */ + ptr = bpf_cgroup_local_storage_get(&cg_storage_1, task->cgroups->dfl_cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!ptr) + return 0; + + /* delete value 0 */ + err = bpf_cgroup_local_storage_delete(&cg_storage_1, task->cgroups->dfl_cgrp); + if (err) + return 0; + + /* value is not available */ + ptr = bpf_cgroup_local_storage_get(&cg_storage_1, task->cgroups->dfl_cgrp, 0, 0); + if (ptr) + return 0; + + /* re-populate the value */ + ptr = bpf_cgroup_local_storage_get(&cg_storage_1, task->cgroups->dfl_cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!ptr) + return 0; + __sync_fetch_and_add(&enter_cnt, 1); + *ptr = MAGIC_VALUE + enter_cnt; + + return 0; +} + +SEC("tp_btf/sys_exit") +int BPF_PROG(on_exit, struct pt_regs *regs, long id) +{ + struct task_struct *task; + long *ptr; + + task = bpf_get_current_task_btf(); + if (task->pid != target_pid) + return 0; + + ptr = bpf_cgroup_local_storage_get(&cg_storage_1, task->cgroups->dfl_cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!ptr) + return 0; + + __sync_fetch_and_add(&exit_cnt, 1); + if (*ptr != MAGIC_VALUE + exit_cnt) + __sync_fetch_and_add(&mismatch_cnt, 1); + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c new file mode 100644 index 000000000000..862683b4cb1e --- /dev/null +++ b/tools/testing/selftests/bpf/progs/cgroup_ls_recursion.c @@ -0,0 +1,70 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +struct { + __uint(type, BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, long); +} map_a SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, long); +} map_b SEC(".maps"); + +SEC("fentry/bpf_local_storage_lookup") +int BPF_PROG(on_lookup) +{ + struct task_struct *task = bpf_get_current_task_btf(); + + bpf_cgroup_local_storage_delete(&map_a, task->cgroups->dfl_cgrp); + bpf_cgroup_local_storage_delete(&map_b, task->cgroups->dfl_cgrp); + return 0; +} + +SEC("fentry/bpf_local_storage_update") +int BPF_PROG(on_update) +{ + struct task_struct *task = bpf_get_current_task_btf(); + long *ptr; + + ptr = bpf_cgroup_local_storage_get(&map_a, task->cgroups->dfl_cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (ptr) + *ptr += 1; + + ptr = bpf_cgroup_local_storage_get(&map_b, task->cgroups->dfl_cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (ptr) + *ptr += 1; + + return 0; +} + +SEC("tp_btf/sys_enter") +int BPF_PROG(on_enter, struct pt_regs *regs, long id) +{ + struct task_struct *task; + long *ptr; + + task = bpf_get_current_task_btf(); + ptr = bpf_cgroup_local_storage_get(&map_a, task->cgroups->dfl_cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (ptr) + *ptr = 200; + + ptr = bpf_cgroup_local_storage_get(&map_b, task->cgroups->dfl_cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (ptr) + *ptr = 100; + return 0; +} -- 2.30.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
end of thread, other threads:[~2022-10-18 23:12 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-10-14 4:56 [PATCH bpf-next 0/5] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 1/5] bpf: Make struct cgroup btf id global Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song 2022-10-17 18:01 ` sdf 2022-10-17 18:25 ` Yosry Ahmed 2022-10-17 18:43 ` Stanislav Fomichev 2022-10-17 18:47 ` Yosry Ahmed 2022-10-17 19:07 ` Stanislav Fomichev 2022-10-17 19:11 ` Yosry Ahmed 2022-10-17 19:26 ` Tejun Heo 2022-10-17 21:07 ` Martin KaFai Lau 2022-10-17 21:23 ` Yosry Ahmed 2022-10-17 23:55 ` Martin KaFai Lau 2022-10-18 0:47 ` Yosry Ahmed 2022-10-17 22:16 ` sdf 2022-10-18 0:52 ` Martin KaFai Lau 2022-10-18 5:59 ` Yonghong Song 2022-10-18 17:08 ` sdf 2022-10-18 17:17 ` Alexei Starovoitov 2022-10-18 18:08 ` Martin KaFai Lau 2022-10-18 18:11 ` Yosry Ahmed 2022-10-18 18:26 ` Yonghong Song 2022-10-18 23:12 ` Andrii Nakryiko 2022-10-17 20:15 ` Yonghong Song 2022-10-17 20:18 ` Yosry Ahmed 2022-10-17 20:13 ` Yonghong Song 2022-10-17 20:10 ` Yonghong Song 2022-10-17 20:14 ` Yosry Ahmed 2022-10-17 20:29 ` Yonghong Song 2022-10-17 19:23 ` Yonghong Song 2022-10-17 21:03 ` Stanislav Fomichev 2022-10-17 22:26 ` Martin KaFai Lau 2022-10-17 18:16 ` David Vernet 2022-10-17 19:45 ` Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 3/5] libbpf: Support new cgroup local storage Yonghong Song 2022-10-14 4:56 ` [PATCH bpf-next 4/5] bpftool: " Yonghong Song 2022-10-17 10:26 ` Quentin Monnet 2022-10-14 4:56 ` [PATCH bpf-next 5/5] selftests/bpf: Add selftests for " Yonghong Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox