From: Yonghong Song <yhs@meta.com>
To: Yosry Ahmed <yosryahmed@google.com>,
Martin KaFai Lau <martin.lau@linux.dev>
Cc: Yonghong Song <yhs@fb.com>, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, KP Singh <kpsingh@kernel.org>,
Martin KaFai Lau <martin.lau@kernel.org>,
Tejun Heo <tj@kernel.org>,
bpf@vger.kernel.org, Roman Gushchin <roman.gushchin@linux.dev>
Subject: Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Date: Tue, 25 Oct 2022 12:53:55 -0700 [thread overview]
Message-ID: <5fc082a3-1aff-edd6-ca50-e9d99ea99743@meta.com> (raw)
In-Reply-To: <CAJD7tkb-tvLbJm2-Zq4YKFZ37ZW==sbHTHOxtdeSzQpDcTcTtw@mail.gmail.com>
On 10/24/22 10:44 PM, Yosry Ahmed wrote:
> On Mon, Oct 24, 2022 at 6:36 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/24/22 5:48 PM, Yosry Ahmed wrote:
>>> On Mon, Oct 24, 2022 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/24/22 5:21 PM, Yosry Ahmed wrote:
>>>>> On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>>
>>>>>> On 10/23/22 11:05 AM, Yonghong Song wrote:
>>>>>>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
>>>>>>> +{
>>>>>>> + struct bpf_local_storage *local_storage;
>>>>>>> + struct bpf_local_storage_elem *selem;
>>>>>>> + bool free_cgroup_storage = false;
>>>>>>> + struct hlist_node *n;
>>>>>>> + unsigned long flags;
>>>>>>> +
>>>>>>> + rcu_read_lock();
>>>>>>> + local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
>>>>>>> + if (!local_storage) {
>>>>>>> + rcu_read_unlock();
>>>>>>> + return;
>>>>>>> + }
>>>>>>> +
>>>>>>> + /* Neither the bpf_prog nor the bpf_map's syscall
>>>>>>> + * could be modifying the local_storage->list now.
>>>>>>> + * Thus, no elem can be added to or deleted from the
>>>>>>> + * local_storage->list by the bpf_prog or by the bpf_map's syscall.
>>>>>>> + *
>>>>>>> + * It is racing with __bpf_local_storage_map_free() alone
>>>>>>> + * when unlinking elem from the local_storage->list and
>>>>>>> + * the map's bucket->list.
>>>>>>> + */
>>>>>>> + bpf_cgrp_storage_lock();
>>>>>>> + raw_spin_lock_irqsave(&local_storage->lock, flags);
>>>>>>> + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>>>>>>> + bpf_selem_unlink_map(selem);
>>>>>>> + /* If local_storage list has only one element, the
>>>>>>> + * bpf_selem_unlink_storage_nolock() will return true.
>>>>>>> + * Otherwise, it will return false. The current loop iteration
>>>>>>> + * intends to remove all local storage. So the last iteration
>>>>>>> + * of the loop will set the free_cgroup_storage to true.
>>>>>>> + */
>>>>>>> + free_cgroup_storage =
>>>>>>> + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>>>>>>> + }
>>>>>>> + raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>>>>>>> + bpf_cgrp_storage_unlock();
>>>>>>> + rcu_read_unlock();
>>>>>>> +
>>>>>>> + if (free_cgroup_storage)
>>>>>>> + kfree_rcu(local_storage, rcu);
>>>>>>> +}
>>>>>>
>>>>>> [ ... ]
>>>>>>
>>>>>>> +/* *gfp_flags* is a hidden argument provided by the verifier */
>>>>>>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
>>>>>>> + void *, value, u64, flags, gfp_t, gfp_flags)
>>>>>>> +{
>>>>>>> + struct bpf_local_storage_data *sdata;
>>>>>>> +
>>>>>>> + WARN_ON_ONCE(!bpf_rcu_lock_held());
>>>>>>> + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
>>>>>>> + return (unsigned long)NULL;
>>>>>>> +
>>>>>>> + if (!cgroup)
>>>>>>> + return (unsigned long)NULL;
>>>>>>> +
>>>>>>> + if (!bpf_cgrp_storage_trylock())
>>>>>>> + return (unsigned long)NULL;
>>>>>>> +
>>>>>>> + sdata = cgroup_storage_lookup(cgroup, map, true);
>>>>>>> + if (sdata)
>>>>>>> + goto unlock;
>>>>>>> +
>>>>>>> + /* only allocate new storage, when the cgroup is refcounted */
>>>>>>> + if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
>>>>>>> + (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
>>>>>>> + sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
>>>>>>> + value, BPF_NOEXIST, gfp_flags);
>>>>>>> +
>>>>>>> +unlock:
>>>>>>> + bpf_cgrp_storage_unlock();
>>>>>>> + return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
>>>>>>> +}
>>>>>>
>>>>>> [ ... ]
>>>>>>
>>>>>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>>>>>> index 764bdd5fd8d1..32145d066a09 100644
>>>>>>> --- a/kernel/cgroup/cgroup.c
>>>>>>> +++ b/kernel/cgroup/cgroup.c
>>>>>>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
>>>>>>> struct cgroup_subsys *ss = css->ss;
>>>>>>> struct cgroup *cgrp = css->cgroup;
>>>>>>>
>>>>>>> +#ifdef CONFIG_BPF_SYSCALL
>>>>>>> + bpf_cgrp_storage_free(cgrp);
>>>>>>> +#endif
>>>>>>
>>>>>>
>>>>>> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
>>>>>>
>>>>>> " ...... it blocks a possibility to implement
>>>>>> the memcg-based memory accounting for bpf objects, because a circular
>>>>>> reference dependency will occur. Charged memory pages are pinning the
>>>>>> corresponding memory cgroup, and if the memory cgroup is pinning
>>>>>> the attached bpf program, nothing will be ever released."
>>>>>>
>>>>>> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
>>>>>> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
>>>>>> loop issue here.
>>>>>>
>>>>>> If here is the right place to free the cgrp_local_storage() and enough to break
>>>>>> this refcnt loop, it will be useful to add some explanation and its
>>>>>> consideration in the commit message.
>>>>>>
>>>>>
>>>>> I think a similar refcount loop issue can happen here as well. IIUC,
>>>>> this function will only be run when the css is released after all
>>>>> references are dropped. So if memcg charging is enabled the cgroup
>>>>> will never be removed. This one might be trickier to handle though..
>>>>
>>>> How about removing all storage from cgrp->bpf_cgrp_storage in
>>>> cgroup_destroy_locked()?
>>>
>>> The problem here is that you lose information for cgroups that went
>>> offline but still exist in the kernel (i.e offline cgroups). The
>>> commit log 4bfc0bb2c60e mentions that such cgroups can have live
>>> sockets attached, so this might be a problem?
>>
>> Keeping the cgrp_storage around is useful for the cgroup-bpf prog that will be
>> called upon some sk events (eg ingress/egress). The cgrp_storage cleanup could
>> be done in cgroup_bpf_release_fn() also such that it will wait till all sk is done.
>>
>>> From a memory perspective, offline memcgs can still undergo memory operations like
>>> reclaim. If we are using BPF to collect cgroup statistics for memory
>>> reclaim, we can't do so for offline memcgs, which is not the end of
>>> the world, but the cgroup storages become slightly less powerful. We
>>> might also lose some data that we have already stored for such offline
>>> memcgs. Also BPF programs now need to handle the case where they have
>>> a valid cgroup pointer but they cannot retrieve a cgroup storage for
>>> it because it went offline.
>>
>> iiuc, the use case is to be able to use the cgrp_storage at some earlier stage
>> of the cgroup destruction.
>
> Yes, exactly. The cgroup gets "offlined" when the user removes the
> directory, but is not actually freed until all references are dropped.
> An offline memcg can still undergo some operations.
>
>> A noob question, I wonder if there is a cgroup that
>> it will never go away, the root cgrp? Then the cgrp_storage cleanup could be
>> more selective and avoid cleaning up the cgrp storage charged to the root cgroup.
>
> Yes, root cgrp doesn't go away, but I am not sure I understand how
> this fixes the problem.
>
> In all cases, Roman confirmed my theory. BPF maps charges are now
> charged through objcg, not directly to memcgs, which means that when
> the cgroup is offline, those charges are moved to the parent. IIUC,
> this means there should not be a refcount loop. When the cgroup
> directory is removed and the cgroup is offlined, the memory charges of
> the cgrp_storage will move to the parent. The cgrp_storage will remain
> accessible until all refs to it are dropped and it is actually freed
> by css_free_rwork_fn(), at that point the cgrp_storage will be freed.
>
> I just realized, I *think* the call to bpf_cgrp_storage_free(cgrp) is
> actually misplaced within css_free_rwork_fn(). Reading the code, it
> seems like css_free_rwork_fn() can be called in two cases:
> (a) We are freeing a css (cgroup_subsys_state), but not the cgroup
> itself (e.g. we are disabling a subsystem controller on a cgroup).
> (b) We are freeing the cgroup itself.
>
> I think we want to free the cgrp_storage only in (b), which would
> correspond to the else statement (but before the nested if).
> Basically, something like this *I think*:
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 2319946715e0..f1e6058089f5 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5349,6 +5349,7 @@ static void css_free_rwork_fn(struct work_struct *work)
> atomic_dec(&cgrp->root->nr_cgrps);
> cgroup1_pidlist_destroy_all(cgrp);
> cancel_work_sync(&cgrp->release_agent_work);
> + bpf_cgrp_storage_free(cgrp);
>
> if (cgroup_parent(cgrp)) {
> /*
>
> Tejun would know better, so please correct me if I am wrong.
Good suggestion. calling bpf_cgrp_storage_free() after
cancel_work_sync() seems a good idea to ensure all pending
works are done.
>
> (FWIW I think it would be nicer if we have an empty stub for
> bpf_cgrp_storage_free() when !CONFIG_BPF_SYSCALL instead of the
> #ifdef, similar to cgroup_bpf_offline()).
Okay, let me give a try.
>
>>
>>> We ideally want to be able to charge the memory to the cgroup without
>>> holding a ref to it, which is against the cgroup memory charging
>>> model.
>>
next prev parent reply other threads:[~2022-10-25 19:54 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-23 18:05 [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-23 18:05 ` [PATCH bpf-next v4 1/7] bpf: Make struct cgroup btf id global Yonghong Song
2022-10-23 19:59 ` David Vernet
2022-10-23 18:05 ` [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse Yonghong Song
2022-10-24 18:02 ` sdf
2022-10-24 19:08 ` Yonghong Song
2022-10-24 20:34 ` Martin KaFai Lau
2022-10-25 2:28 ` Yonghong Song
2022-10-23 18:05 ` [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-23 20:02 ` David Vernet
2022-10-24 19:19 ` Martin KaFai Lau
2022-10-25 0:21 ` Yosry Ahmed
2022-10-25 0:32 ` Martin KaFai Lau
2022-10-25 0:48 ` Yosry Ahmed
2022-10-25 0:55 ` Yosry Ahmed
2022-10-25 2:38 ` Roman Gushchin
2022-10-25 5:46 ` Yosry Ahmed
2022-10-25 1:36 ` Martin KaFai Lau
2022-10-25 5:44 ` Yosry Ahmed
2022-10-25 19:53 ` Yonghong Song [this message]
2022-10-23 18:05 ` [PATCH bpf-next v4 4/7] libbpf: Support new cgroup local storage Yonghong Song
2022-10-23 20:03 ` David Vernet
2022-10-23 18:05 ` [PATCH bpf-next v4 5/7] bpftool: " Yonghong Song
2022-10-23 18:05 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for " Yonghong Song
2022-10-23 20:14 ` David Vernet
2022-10-24 19:03 ` Yonghong Song
2022-10-24 20:30 ` Martin KaFai Lau
2022-10-25 2:26 ` Yonghong Song
2022-10-23 18:05 ` [PATCH bpf-next v4 7/7] docs/bpf: Add documentation " Yonghong Song
2022-10-23 20:26 ` David Vernet
2022-10-24 19:05 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5fc082a3-1aff-edd6-ca50-e9d99ea99743@meta.com \
--to=yhs@meta.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=kpsingh@kernel.org \
--cc=martin.lau@kernel.org \
--cc=martin.lau@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=tj@kernel.org \
--cc=yhs@fb.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox