From: Yonghong Song <yhs@meta.com>
To: David Vernet <void@manifault.com>
Cc: Yonghong Song <yhs@fb.com>,
bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, KP Singh <kpsingh@kernel.org>,
Martin KaFai Lau <martin.lau@kernel.org>,
Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH bpf-next v2 2/6] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Date: Fri, 21 Oct 2022 15:57:15 -0700 [thread overview]
Message-ID: <c815edb6-b008-07f4-2377-17b53ccdc289@meta.com> (raw)
In-Reply-To: <Y1L5oZdzn3kxZL+G@maniforge.dhcp.thefacebook.com>
On 10/21/22 12:57 PM, David Vernet wrote:
> On Fri, Oct 21, 2022 at 10:33:41AM -0700, Yonghong Song wrote:
>
> [...]
>
>>>> /* Note that tracing related programs such as
>>>> @@ -5435,6 +5443,42 @@ union bpf_attr {
>>>> * **-E2BIG** if user-space has tried to publish a sample which is
>>>> * larger than the size of the ring buffer, or which cannot fit
>>>> * within a struct bpf_dynptr.
>>>> + *
>>>> + * void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
>>>> + * Description
>>>> + * Get a bpf_local_storage from the *cgroup*.
>>>> + *
>>>> + * Logically, it could be thought of as getting the value from
>>>> + * a *map* with *cgroup* as the **key**. From this
>>>> + * perspective, the usage is not much different from
>>>> + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
>>>> + * helper enforces the key must be a cgroup struct and the map must also
>>>> + * be a **BPF_MAP_TYPE_CGRP_STORAGE**.
>>>> + *
>>>> + * Underneath, the value is stored locally at *cgroup* instead of
>>>> + * the *map*. The *map* is used as the bpf-local-storage
>>>> + * "type". The bpf-local-storage "type" (i.e. the *map*) is
>>>> + * searched against all bpf_local_storage residing at *cgroup*.
>>>
>>> IMO this paragraph is a bit hard to parse. Please correct me if I'm
>>> wrong, but I think what it's trying to convey is that when an instance
>>> of cgroup bpf-local-storage is accessed by a program in e.g.
>>> bpf_cgrp_storage_get(), all of the cgroup bpf_local_storage entries are
>>> iterated over in the struct cgroup object until this program's local
>>> storage instance is found. Is that right? If so, perhaps something like
>>> this would be more clear:
>>
>> yes. your above interpretation is correct.
>>
>>>
>>> In reality, the local-storage value is embedded directly inside of the
>>> *cgroup* object itself, rather than being located in the
>>> **BPF_MAP_TYPE_CGRP_STORAGE** map. When the local-storage value is
>>> queried for some *map* on a *cgroup* object, the kernel will perform an
>>> O(n) iteration over all of the live local-storage values for that
>>> *cgroup* object until the local-storage value for the *map* is found.
>>
>> Sounds okay. I can change the explanation like the above.
>
> Thanks!
>
>>>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
>>>> index 341c94f208f4..3a12e6b400a2 100644
>>>> --- a/kernel/bpf/Makefile
>>>> +++ b/kernel/bpf/Makefile
>>>> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
>>>> obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
>>>> endif
>>>> ifeq ($(CONFIG_CGROUPS),y)
>>>
>>> I assume that you double checked that it's valid to compile the helper
>>> with CONFIG_CGROUPS && !CONFIG_CGROUP_BPF, but I must admit that even if
>>> that's the case, I'm not following why we would want the map to be
>>> compiled with a different kconfig option than the helper that provides
>>> access to it. If theres's a precedent for doing this then I suppose it's
>>> fine, but it does seem wrong and/or at least wasteful to compile these
>>> helpers in if CONFIG_CGROUPS is defined but CONFIG_CGROUP_BPF is not.
>>
>> The following is my understanding.
>> CONFIG_CGROUP_BPF guards kernel/bpf/cgroup.c which contains implementation
>> mostly for cgroup-attached program types, helpers, etc.
>
> Then why are we using it to guard
> BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)?
>
>> A lot of other cgroup-related implementation like cgroup_iter, some
>> cgroup related helper (not related to cgroup-attached program types), etc.
>> are guarded with CONFIG_CGROUPS and CONFIG_BPF_SYSCALL.
>>
>> Note that it is totally possible CONFIG_CGROUP_BPF is 'n' while
>> CONFIG_CGROUPS and CONFIG_BPF_SYSCALL are 'y'.
>>
>> So for cgroup local storage implemented in this patch set,
>> using CONFIG_CGROUPS and CONFIG_BPF_SYSCALL seems okay.
>
> I agree that it's fine to use CONFIG_CGROUPS here. What I'm not
> understanding is why we're using CONFIG_CGROUP_BPF to guard defining
> BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops), and then
> in the Makefile we're using CONFIG_CGROUPS to add bpf_cgrp_storage.o.
>
> In other words, I think there's a mismatch between:
>
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
> #ifdef CONFIG_CGROUP_BPF
>
> ^^ why this instead of CONFIG_CGROUPS for BPF_MAP_TYPE_CGRP_STORAGE?
>
> BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)
> #endif
> BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
>
> and
>
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 341c94f208f4..3a12e6b400a2 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> endif
> ifeq ($(CONFIG_CGROUPS),y)
> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o
> endif
> obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> ifeq ($(CONFIG_INET),y)
This makes sense. I will guard
BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)
with CONFIG_CGROUPS.
>
>>>> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
>>>> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o
>>>> endif
>>>> obj-$(CONFIG_CGROUP_BPF) += cgroup.o
>>>> ifeq ($(CONFIG_INET),y)
>
> [...]
>
>>>> + * could be modifying the local_storage->list now.
>>>> + * Thus, no elem can be added-to or deleted-from the
>>>> + * local_storage->list by the bpf_prog or by the bpf-map's syscall.
>>>> + *
>>>> + * It is racing with bpf_local_storage_map_free() alone
>>>> + * when unlinking elem from the local_storage->list and
>>>> + * the map's bucket->list.
>>>> + */
>>>> + bpf_cgrp_storage_lock();
>>>> + raw_spin_lock_irqsave(&local_storage->lock, flags);
>>>> + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>>>> + bpf_selem_unlink_map(selem);
>>>> + free_cgroup_storage =
>>>> + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>>>
>>> This still requires a comment explaining why it's OK to overwrite
>>> free_cgroup_storage with a previous value from calling
>>> bpf_selem_unlink_storage_nolock(). Even if that is safe, this looks like
>>> a pretty weird programming pattern, and IMO doing this feels more
>>> intentional and future-proof:
>>>
>>> if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false))
>>> free_cgroup_storage = true;
>>
>> We have a comment a few lines below.
>> /* free_cgroup_storage should always be true as long as
>> * local_storage->list was non-empty.
>> */
>> if (free_cgroup_storage)
>> kfree_rcu(local_storage, rcu);
>
> IMO that comment doesn't provide much useful information -- it states an
> assumption, but doesn't give a reason for it.
>
>> I will add more explanation in the above code like
>>
>> bpf_selem_unlink_map(selem);
>> /* If local_storage list only have one element, the
>> * bpf_selem_unlink_storage_nolock() will return true.
>> * Otherwise, it will return false. The current loop iteration
>> * intends to remove all local storage. So the last iteration
>> * of the loop will set the free_cgroup_storage to true.
>> */
>> free_cgroup_storage =
>> bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>
> Thanks, this is the type of comment I was looking for.
>
> Also, I realize this was copy-pasted from a number of other possible
> locations in the codebase which are doing the same thing, but I still
> think this pattern is an odd and brittle way to do this. We're relying
> on an abstracted implementation detail of
> bpf_selem_unlink_storage_nolock() for correctness, which IMO is a signal
> that bpf_selem_unlink_storage_nolock() should probably be the one
> invoking kfree_rcu() on behalf of callers in the first place. It looks
> like all of the callers end up calling kfree_rcu() on the struct
> bpf_local_storage * if bpf_selem_unlink_storage_nolock() returns true,
> so can we just move the responsibility of freeing the local storage
> object down into bpf_selem_unlink_storage_nolock() where it's unlinked?
We probably cannot do this. bpf_selem_unlink_storage_nolock()
is inside the rcu_read_lock() region. We do kfree_rcu() outside
the rcu_read_lock() region.
>
> IMO this can be done in a separate patch set, if we decide it's worth
> doing at all.
>
>>>
>>>> + }
>>>> + raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>>>> + bpf_cgrp_storage_unlock();
>>>> + rcu_read_unlock();
>>>> +
>>>> + /* free_cgroup_storage should always be true as long as
>>>> + * local_storage->list was non-empty.
>>>> + */
>>>> + if (free_cgroup_storage)
>>>> + kfree_rcu(local_storage, rcu);
>>>> +}
>>>> +
>>>> +static struct bpf_local_storage_data *
>>>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
>>>> +{
>>>> + struct bpf_local_storage *cgroup_storage;
>>>> + struct bpf_local_storage_map *smap;
>>>> +
>>>> + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage,
>>>> + bpf_rcu_lock_held());
>>>> + if (!cgroup_storage)
>>>> + return NULL;
>>>> +
>>>> + smap = (struct bpf_local_storage_map *)map;
>>>> + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
>>>> +}
>>>> +
>>>> +static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
>>>> +{
>>>> + struct bpf_local_storage_data *sdata;
>>>> + struct cgroup *cgroup;
>>>> + int fd;
>>>> +
>>>> + fd = *(int *)key;
>>>> + cgroup = cgroup_get_from_fd(fd);
>>>> + if (IS_ERR(cgroup))
>>>> + return ERR_CAST(cgroup);
>>>> +
>>>> + bpf_cgrp_storage_lock();
>>>> + sdata = cgroup_storage_lookup(cgroup, map, true);
>>>> + bpf_cgrp_storage_unlock();
>>>> + cgroup_put(cgroup);
>>>> + return sdata ? sdata->data : NULL;
>>>> +}
>>>
>>> Stanislav pointed out in the v1 revision that there's a lot of very
>>> similar logic in task storage, and I think you'd mentioned that you were
>>> going to think about generalizing some of that. Have you had a chance to
>>> consider?
>>
>> It is hard to have a common function for
>> lookup_elem/update_elem/delete_elem(). They are quite different as each
>> heavily involves
>> task/cgroup-specific functions.
>
> Yes agreed, each implementation is acquiring their own references, and
> finding the backing element in whatever way it was implemented, etc.
>
>> but map_alloc and map_free could have common helpers.
>
> Agreed, and many of the static functions that are invoked on those paths
> such as bpf_cgrp_storage_free(), bpf_cgrp_storage_lock(), etc possibly
> as well. In general this feels like something we could pretty easily
> simplify using something like a structure with callbacks to implement
> the pieces of logic that are specific to each local storage type, such
> as getting the struct bpf_local_storage __rcu
> * pointer from some context (e.g. cgroup_storage_ptr()). It doesn't
> necessarily need to block this change, but IMO we should clean this up
> soon because a lot of this is nearly a 100% copy-paste of other local
> storage implementations.
Further refactoring is possible. Martin is working to simplify the
locking mechanism. We can wait for that done before doing refactoring.
>
> Thanks,
> David
next prev parent reply other threads:[~2022-10-21 22:57 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-20 22:12 [PATCH bpf-next v2 0/6] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-20 22:13 ` [PATCH bpf-next v2 1/6] bpf: Make struct cgroup btf id global Yonghong Song
2022-10-20 22:13 ` [PATCH bpf-next v2 2/6] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-21 5:22 ` David Vernet
2022-10-21 5:26 ` David Vernet
2022-10-21 17:33 ` Yonghong Song
2022-10-21 19:57 ` David Vernet
2022-10-21 22:57 ` Yonghong Song [this message]
2022-10-22 3:02 ` David Vernet
2022-10-23 16:45 ` Yonghong Song
2022-10-23 21:14 ` David Vernet
[not found] ` <202210210932.nHqTyTmx-lkp@intel.com>
2022-10-21 16:51 ` Yonghong Song
2022-10-21 19:29 ` Yosry Ahmed
2022-10-21 21:05 ` Yonghong Song
2022-10-20 22:13 ` [PATCH bpf-next v2 3/6] libbpf: Support new cgroup local storage Yonghong Song
2022-10-21 23:10 ` Andrii Nakryiko
2022-10-22 0:32 ` Yonghong Song
2022-10-22 1:05 ` Tejun Heo
2022-10-20 22:13 ` [PATCH bpf-next v2 4/6] bpftool: " Yonghong Song
2022-10-20 22:13 ` [PATCH bpf-next v2 5/6] selftests/bpf: Add selftests for " Yonghong Song
2022-10-20 22:13 ` [PATCH bpf-next v2 6/6] docs/bpf: Add documentation for map type BPF_MAP_TYPE_CGRP_STROAGE Yonghong Song
2022-10-21 7:12 ` David Vernet
2022-10-21 17:46 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c815edb6-b008-07f4-2377-17b53ccdc289@meta.com \
--to=yhs@meta.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=kpsingh@kernel.org \
--cc=martin.lau@kernel.org \
--cc=tj@kernel.org \
--cc=void@manifault.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox