Re: [PATCH bpf-next v3 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

From: Yonghong Song <yhs@meta.com>
To: David Vernet <void@manifault.com>, Yonghong Song <yhs@fb.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	kernel-team@fb.com, KP Singh <kpsingh@kernel.org>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH bpf-next v3 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Date: Sun, 23 Oct 2022 10:19:28 -0700	[thread overview]
Message-ID: <8344762f-5f24-cf05-4062-876fd94a3e89@meta.com> (raw)
In-Reply-To: <Y1TQof8LPg1Btdbq@maniforge.dhcp.thefacebook.com>



On 10/22/22 10:26 PM, David Vernet wrote:
> On Fri, Oct 21, 2022 at 04:44:32PM -0700, Yonghong Song wrote:
>> Similar to sk/inode/task storage, implement similar cgroup local storage.
>>
>> There already exists a local storage implementation for cgroup-attached
>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
>> bpf_get_local_storage(). But there are use cases such that non-cgroup
>> attached bpf progs wants to access cgroup local storage data. For example,
>> tc egress prog has access to sk and cgroup. It is possible to use
>> sk local storage to emulate cgroup local storage by storing data in socket.
>> But this is a waste as it could be lots of sockets belonging to a particular
>> cgroup. Alternatively, a separate map can be created with cgroup id as the key.
>> But this will introduce additional overhead to manipulate the new map.
>> A cgroup local storage, similar to existing sk/inode/task storage,
>> should help for this use case.
>>
>> The life-cycle of storage is managed with the life-cycle of the
>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
>> with a callback to the bpf_cgrp_storage_free when cgroup itself
> 
> Small nit: This isn't really done as a callback, it's just a normal
> function call, right?

Oh, yes, it is just a function call. Will make the change.

> 
>> is deleted.
>>
>> The userspace map operations can be done by using a cgroup fd as a key
>> passed to the lookup, update and delete operations.
>>
>> Typically, the following code is used to get the current cgroup:
>>      struct task_struct *task = bpf_get_current_task_btf();
>>      ... task->cgroups->dfl_cgrp ...
>> and in structure task_struct definition:
>>      struct task_struct {
>>          ....
>>          struct css_set __rcu            *cgroups;
>>          ....
>>      }
>> With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
>> So the current implementation only supports non-sleepable program and supporting
>> sleepable program will be the next step together with adding rcu_read_lock
>> protection for rcu tagged structures.
>>
>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
>> storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
>> for cgroup storage available to non-cgroup-attached bpf programs. The old
>> cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
>> The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
>> functionality. While old cgroup storage pre-allocates storage memory, the new
>> mechanism can also pre-allocate with a user space bpf_map_update_elem() call
>> to avoid potential run-time memory allocation failure.
>> Therefore, the new cgroup storage can provide all functionality w.r.t.
>> the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
>> BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
>> be deprecated since the new one can provide the same functionality.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
> 
> [...]
> 
>> diff --git a/kernel/bpf/bpf_cgrp_storage.c b/kernel/bpf/bpf_cgrp_storage.c
>> new file mode 100644
>> index 000000000000..770c9c28215a
>> --- /dev/null
>> +++ b/kernel/bpf/bpf_cgrp_storage.c
>> @@ -0,0 +1,268 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
>> + */
>> +
>> +#include <linux/types.h>
>> +#include <linux/bpf.h>
>> +#include <linux/bpf_local_storage.h>
>> +#include <uapi/linux/btf.h>
>> +#include <linux/btf_ids.h>
>> +
>> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
>> +
>> +static DEFINE_PER_CPU(int, bpf_cgrp_storage_busy);
>> +
>> +static void bpf_cgrp_storage_lock(void)
>> +{
>> +	migrate_disable();
>> +	this_cpu_inc(bpf_cgrp_storage_busy);
>> +}
>> +
>> +static void bpf_cgrp_storage_unlock(void)
>> +{
>> +	this_cpu_dec(bpf_cgrp_storage_busy);
>> +	migrate_enable();
>> +}
>> +
>> +static bool bpf_cgrp_storage_trylock(void)
>> +{
>> +	migrate_disable();
>> +	if (unlikely(this_cpu_inc_return(bpf_cgrp_storage_busy) != 1)) {
>> +		this_cpu_dec(bpf_cgrp_storage_busy);
>> +		migrate_enable();
>> +		return false;
>> +	}
>> +	return true;
>> +}
>> +
>> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
>> +{
>> +	struct cgroup *cg = owner;
>> +
>> +	return &cg->bpf_cgrp_storage;
>> +}
>> +
>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
> 
> I was originally going to ask what you thought about also merging this
> logic into bpf_local_storage.h, but I think it's fine to just land this
> as is and refactor after.
> 
> I do think it would be a good cleanup to later refactor a lot of the
> local storage logic to be callback based (assuming we're ok with an
> extra indirect call), as much of what it's doing is almost the exact
> same thing in a very slightly different way. For example,
> bpf_pid_task_storage_lookup_elem() is looking up a pid by an fd,
> acquiring a reference, and then returning the struct
> bpf_local_storage_data * embedded in the task struct. If doing that in
> general sounds like a reasonable idea, I can take care of it as
> follow-on work after this lands.

Thanks!

> 
>> +{
>> +	struct bpf_local_storage *local_storage;
>> +	struct bpf_local_storage_elem *selem;
>> +	bool free_cgroup_storage = false;
>> +	struct hlist_node *n;
>> +	unsigned long flags;
>> +
>> +	rcu_read_lock();
>> +	local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
>> +	if (!local_storage) {
>> +		rcu_read_unlock();
>> +		return;
>> +	}
>> +
>> +	/* Neither the bpf_prog nor the bpf_map's syscall
>> +	 * could be modifying the local_storage->list now.
>> +	 * Thus, no elem can be added to or deleted from the
>> +	 * local_storage->list by the bpf_prog or by the bpf_map's syscall.
>> +	 *
>> +	 * It is racing with __bpf_local_storage_map_free() alone
>> +	 * when unlinking elem from the local_storage->list and
>> +	 * the map's bucket->list.
>> +	 */
>> +	bpf_cgrp_storage_lock();
>> +	raw_spin_lock_irqsave(&local_storage->lock, flags);
>> +	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>> +		bpf_selem_unlink_map(selem);
>> +		/* If local_storage list has only one element, the
>> +		 * bpf_selem_unlink_storage_nolock() will return true.
>> +		 * Otherwise, it will return false. The current loop iteration
>> +		 * intends to remove all local storage. So the last iteration
>> +		 * of the loop will set the free_cgroup_storage to true.
>> +		 */
>> +		free_cgroup_storage =
>> +			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>> +	}
>> +	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>> +	bpf_cgrp_storage_unlock();
>> +	rcu_read_unlock();
>> +
>> +	if (free_cgroup_storage)
>> +		kfree_rcu(local_storage, rcu);
>> +}
>> +
>> +static struct bpf_local_storage_data *
>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
>> +{
>> +	struct bpf_local_storage *cgroup_storage;
>> +	struct bpf_local_storage_map *smap;
>> +
>> +	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage,
>> +					       bpf_rcu_lock_held());
>> +	if (!cgroup_storage)
>> +		return NULL;
>> +
>> +	smap = (struct bpf_local_storage_map *)map;
>> +	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
>> +}
>> +
>> +static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
>> +{
>> +	struct bpf_local_storage_data *sdata;
>> +	struct cgroup *cgroup;
>> +	int fd;
>> +
>> +	fd = *(int *)key;
>> +	cgroup = cgroup_get_from_fd(fd);
>> +	if (IS_ERR(cgroup))
>> +		return ERR_CAST(cgroup);
>> +
>> +	bpf_cgrp_storage_lock();
>> +	sdata = cgroup_storage_lookup(cgroup, map, true);
>> +	bpf_cgrp_storage_unlock();
>> +	cgroup_put(cgroup);
>> +	return sdata ? sdata->data : NULL;
> 
> Do you think it's worth it to add a WARN_ON_ONCE(!rcu_read_lock_held());
> somewhere in this function?

We should be okay here.
bpf_map_lookup_elem() is not allowed for CGRP_STORAGE map
in bpf program.

         case BPF_MAP_TYPE_CGRP_STORAGE:
                 if (func_id != BPF_FUNC_cgrp_storage_get &&
                     func_id != BPF_FUNC_cgrp_storage_delete)
                         goto error;
                 break;


At syscall side, we have explicit rcu_read_lock/unlock()
in kernel/bpf/syscall.c to protect
	ptr = map->ops->map_lookup_elem(map, key);

So WARN_ON_ONCE(!rcu_read_lock_held()) will never hit.
We are fine here.

> 
> [...]
> 
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 764bdd5fd8d1..7e80e15fae4e 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
>>   	struct cgroup_subsys *ss = css->ss;
>>   	struct cgroup *cgrp = css->cgroup;
>>   
>> +#ifdef CONFIG_CGROUP_BPF
> 
> I think this should be #ifdef CONFIG_BPF_SYSCALL?

Yes, this is my oversight. I forgot this place while changing the
structure definion site.

> 
>> +	bpf_cgrp_storage_free(cgrp);
>> +#endif
>> +
> 
> This looks pretty close to ready from my end, just a couple more
> small questions / comments.
> 
> Thanks,
> David

next prev parent reply	other threads:[~2022-10-23 17:20 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-21 23:44 [PATCH bpf-next v3 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 1/7] bpf: Make struct cgroup btf id global Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse Yonghong Song
2022-10-23  5:00   ` David Vernet
2022-10-21 23:44 ` [PATCH bpf-next v3 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-23  5:26   ` David Vernet
2022-10-23 17:19     ` Yonghong Song [this message]
2022-10-21 23:44 ` [PATCH bpf-next v3 4/7] libbpf: Support new cgroup local storage Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 5/7] bpftool: " Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 6/7] selftests/bpf: Add selftests for " Yonghong Song
2022-10-21 23:44 ` [PATCH bpf-next v3 7/7] docs/bpf: Add documentation for map type BPF_MAP_TYPE_CGRP_STROAGE Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8344762f-5f24-cf05-4062-876fd94a3e89@meta.com \
    --to=yhs@meta.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox