From: Yonghong Song <yhs@meta.com>
To: David Vernet <void@manifault.com>, Yonghong Song <yhs@fb.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, KP Singh <kpsingh@kernel.org>,
Martin KaFai Lau <martin.lau@kernel.org>,
Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH bpf-next v3 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Date: Sun, 23 Oct 2022 10:19:28 -0700 [thread overview]
Message-ID: <8344762f-5f24-cf05-4062-876fd94a3e89@meta.com> (raw)
In-Reply-To: <Y1TQof8LPg1Btdbq@maniforge.dhcp.thefacebook.com>
On 10/22/22 10:26 PM, David Vernet wrote:
> On Fri, Oct 21, 2022 at 04:44:32PM -0700, Yonghong Song wrote:
>> Similar to sk/inode/task storage, implement similar cgroup local storage.
>>
>> There already exists a local storage implementation for cgroup-attached
>> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>> attached bpf progs wants to access cgroup local storage data. For example,
>> tc egress prog has access to sk and cgroup. It is possible to use
>> sk local storage to emulate cgroup local storage by storing data in socket.
>> But this is a waste as it could be lots of sockets belonging to a particular
>> cgroup. Alternatively, a separate map can be created with cgroup id as the key.
>> But this will introduce additional overhead to manipulate the new map.
>> A cgroup local storage, similar to existing sk/inode/task storage,
>> should help for this use case.
>>
>> The life-cycle of storage is managed with the life-cycle of the
>> cgroup struct. i.e. the storage is destroyed along with the owning cgroup
>> with a callback to the bpf_cgrp_storage_free when cgroup itself
>
> Small nit: This isn't really done as a callback, it's just a normal
> function call, right?
Oh, yes, it is just a function call. Will make the change.
>
>> is deleted.
>>
>> The userspace map operations can be done by using a cgroup fd as a key
>> passed to the lookup, update and delete operations.
>>
>> Typically, the following code is used to get the current cgroup:
>> struct task_struct *task = bpf_get_current_task_btf();
>> ... task->cgroups->dfl_cgrp ...
>> and in structure task_struct definition:
>> struct task_struct {
>> ....
>> struct css_set __rcu *cgroups;
>> ....
>> }
>> With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
>> So the current implementation only supports non-sleepable program and supporting
>> sleepable program will be the next step together with adding rcu_read_lock
>> protection for rcu tagged structures.
>>
>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
>> storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
>> for cgroup storage available to non-cgroup-attached bpf programs. The old
>> cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
>> The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
>> functionality. While old cgroup storage pre-allocates storage memory, the new
>> mechanism can also pre-allocate with a user space bpf_map_update_elem() call
>> to avoid potential run-time memory allocation failure.
>> Therefore, the new cgroup storage can provide all functionality w.r.t.
>> the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
>> BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
>> be deprecated since the new one can provide the same functionality.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>
> [...]
>
>> diff --git a/kernel/bpf/bpf_cgrp_storage.c b/kernel/bpf/bpf_cgrp_storage.c
>> new file mode 100644
>> index 000000000000..770c9c28215a
>> --- /dev/null
>> +++ b/kernel/bpf/bpf_cgrp_storage.c
>> @@ -0,0 +1,268 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
>> + */
>> +
>> +#include <linux/types.h>
>> +#include <linux/bpf.h>
>> +#include <linux/bpf_local_storage.h>
>> +#include <uapi/linux/btf.h>
>> +#include <linux/btf_ids.h>
>> +
>> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
>> +
>> +static DEFINE_PER_CPU(int, bpf_cgrp_storage_busy);
>> +
>> +static void bpf_cgrp_storage_lock(void)
>> +{
>> + migrate_disable();
>> + this_cpu_inc(bpf_cgrp_storage_busy);
>> +}
>> +
>> +static void bpf_cgrp_storage_unlock(void)
>> +{
>> + this_cpu_dec(bpf_cgrp_storage_busy);
>> + migrate_enable();
>> +}
>> +
>> +static bool bpf_cgrp_storage_trylock(void)
>> +{
>> + migrate_disable();
>> + if (unlikely(this_cpu_inc_return(bpf_cgrp_storage_busy) != 1)) {
>> + this_cpu_dec(bpf_cgrp_storage_busy);
>> + migrate_enable();
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
>> +{
>> + struct cgroup *cg = owner;
>> +
>> + return &cg->bpf_cgrp_storage;
>> +}
>> +
>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
>
> I was originally going to ask what you thought about also merging this
> logic into bpf_local_storage.h, but I think it's fine to just land this
> as is and refactor after.
>
> I do think it would be a good cleanup to later refactor a lot of the
> local storage logic to be callback based (assuming we're ok with an
> extra indirect call), as much of what it's doing is almost the exact
> same thing in a very slightly different way. For example,
> bpf_pid_task_storage_lookup_elem() is looking up a pid by an fd,
> acquiring a reference, and then returning the struct
> bpf_local_storage_data * embedded in the task struct. If doing that in
> general sounds like a reasonable idea, I can take care of it as
> follow-on work after this lands.
Thanks!
>
>> +{
>> + struct bpf_local_storage *local_storage;
>> + struct bpf_local_storage_elem *selem;
>> + bool free_cgroup_storage = false;
>> + struct hlist_node *n;
>> + unsigned long flags;
>> +
>> + rcu_read_lock();
>> + local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
>> + if (!local_storage) {
>> + rcu_read_unlock();
>> + return;
>> + }
>> +
>> + /* Neither the bpf_prog nor the bpf_map's syscall
>> + * could be modifying the local_storage->list now.
>> + * Thus, no elem can be added to or deleted from the
>> + * local_storage->list by the bpf_prog or by the bpf_map's syscall.
>> + *
>> + * It is racing with __bpf_local_storage_map_free() alone
>> + * when unlinking elem from the local_storage->list and
>> + * the map's bucket->list.
>> + */
>> + bpf_cgrp_storage_lock();
>> + raw_spin_lock_irqsave(&local_storage->lock, flags);
>> + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>> + bpf_selem_unlink_map(selem);
>> + /* If local_storage list has only one element, the
>> + * bpf_selem_unlink_storage_nolock() will return true.
>> + * Otherwise, it will return false. The current loop iteration
>> + * intends to remove all local storage. So the last iteration
>> + * of the loop will set the free_cgroup_storage to true.
>> + */
>> + free_cgroup_storage =
>> + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>> + }
>> + raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>> + bpf_cgrp_storage_unlock();
>> + rcu_read_unlock();
>> +
>> + if (free_cgroup_storage)
>> + kfree_rcu(local_storage, rcu);
>> +}
>> +
>> +static struct bpf_local_storage_data *
>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
>> +{
>> + struct bpf_local_storage *cgroup_storage;
>> + struct bpf_local_storage_map *smap;
>> +
>> + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage,
>> + bpf_rcu_lock_held());
>> + if (!cgroup_storage)
>> + return NULL;
>> +
>> + smap = (struct bpf_local_storage_map *)map;
>> + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
>> +}
>> +
>> +static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
>> +{
>> + struct bpf_local_storage_data *sdata;
>> + struct cgroup *cgroup;
>> + int fd;
>> +
>> + fd = *(int *)key;
>> + cgroup = cgroup_get_from_fd(fd);
>> + if (IS_ERR(cgroup))
>> + return ERR_CAST(cgroup);
>> +
>> + bpf_cgrp_storage_lock();
>> + sdata = cgroup_storage_lookup(cgroup, map, true);
>> + bpf_cgrp_storage_unlock();
>> + cgroup_put(cgroup);
>> + return sdata ? sdata->data : NULL;
>
> Do you think it's worth it to add a WARN_ON_ONCE(!rcu_read_lock_held());
> somewhere in this function?
We should be okay here.
bpf_map_lookup_elem() is not allowed for CGRP_STORAGE map
in bpf program.
case BPF_MAP_TYPE_CGRP_STORAGE:
if (func_id != BPF_FUNC_cgrp_storage_get &&
func_id != BPF_FUNC_cgrp_storage_delete)
goto error;
break;
At syscall side, we have explicit rcu_read_lock/unlock()
in kernel/bpf/syscall.c to protect
ptr = map->ops->map_lookup_elem(map, key);
So WARN_ON_ONCE(!rcu_read_lock_held()) will never hit.
We are fine here.
>
> [...]
>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 764bdd5fd8d1..7e80e15fae4e 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
>> struct cgroup_subsys *ss = css->ss;
>> struct cgroup *cgrp = css->cgroup;
>>
>> +#ifdef CONFIG_CGROUP_BPF
>
> I think this should be #ifdef CONFIG_BPF_SYSCALL?
Yes, this is my oversight. I forgot this place while changing the
structure definion site.
>
>> + bpf_cgrp_storage_free(cgrp);
>> +#endif
>> +
>
> This looks pretty close to ready from my end, just a couple more
> small questions / comments.
>
> Thanks,
> David
next prev parent reply other threads:[~2022-10-23 17:20 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-21 23:44 [PATCH bpf-next v3 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 1/7] bpf: Make struct cgroup btf id global Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse Yonghong Song
2022-10-23 5:00 ` David Vernet
2022-10-21 23:44 ` [PATCH bpf-next v3 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-23 5:26 ` David Vernet
2022-10-23 17:19 ` Yonghong Song [this message]
2022-10-21 23:44 ` [PATCH bpf-next v3 4/7] libbpf: Support new cgroup local storage Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 5/7] bpftool: " Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 6/7] selftests/bpf: Add selftests for " Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 7/7] docs/bpf: Add documentation for map type BPF_MAP_TYPE_CGRP_STROAGE Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8344762f-5f24-cf05-4062-876fd94a3e89@meta.com \
--to=yhs@meta.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=kpsingh@kernel.org \
--cc=martin.lau@kernel.org \
--cc=tj@kernel.org \
--cc=void@manifault.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox