From: Yonghong Song <yonghong.song@linux.dev>
To: David Vernet <void@manifault.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Yafang Shao <laoar.shao@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
John Fastabend <john.fastabend@gmail.com>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, bpf <bpf@vger.kernel.org>
Subject: Re: [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu
Date: Tue, 1 Aug 2023 23:54:18 -0700 [thread overview]
Message-ID: <8b6b0703-4ed6-c0cb-c61a-9ebcfb5fe668@linux.dev> (raw)
In-Reply-To: <20230802032958.GB472124@maniforge>
On 8/1/23 8:29 PM, David Vernet wrote:
> On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote:
>> On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>
>>>>
>>>> In kernel, we have a global variable
>>>> nr_cpu_ids (also in kernel/bpf/helpers.c)
>>>> which is used in numerous places for per cpu data struct access.
>>>>
>>>> I am wondering whether we could have bpf code like
>>>> int nr_cpu_ids __ksym;
>
> I think this would be useful in general, though any __ksym variable like
> this would have to be const and mapped in .rodata, right? But yeah,
> being able to R/O map global variables like this which have static
> lifetimes would be nice.
No. There is no map here. __ksym symbol will have a ld_imm64 insn
to load the value in the bpf code. The address will be the kernel
address patched by libbpf.
>
> It's not quite the same thing as nr_cpu_ids, but FWIW, you could
> accomplish something close to this by doing something like this in your
> BPF prog:
>
> /* Set in user space to libbpf_num_possible_cpus() */
> const volatile __u32 nr_cpus;
This is another approach. In this case, nr_cpus will be
stored in a map.
I don't know the difference between kernel nr_cpu_ids vs.
libbpf_num_possible_cpus(). I am choosing nr_cpu_ids
because it is the one used inside the kernel. If
libbpf_num_possible_cpus() effectively nr_cpu_ids,
then happy to use libbpf_num_possible_cpus() which
is more user/libbpf friendly.
>
> ...
> __u32 i;
>
> bpf_for(i, 0, nr_cpus)
> bpf_printk("Iterating over cpu %u", i);
>
> ...
>
>>>> struct bpf_iter_num it;
>>>> int i = 0;
>>>>
>>>> // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
>>>> bpf_iter_num_new(&it, 1, nr_cpu_ids);
>>>> while ((v = bpf_iter_num_next(&it))) {
>>>> /* access cpu i data */
>>>> i++;
>>>> }
>>>> bpf_iter_num_destroy(&it);
>>>>
>>>> From all existing open coded iterator loops, looks like
>>>> upper bound has to be a constant. We might need to extend support
>>>> to bounded scalar upper bound if not there.
>>>
>>> Currently the upper bound is required by both the open-coded for-loop
>>> and the bpf_loop. I think we can extend it.
>>>
>>> It can't handle the cpumask case either.
>>>
>>> for_each_cpu(cpu, mask)
>>>
>>> In the 'mask', the CPU IDs might not be continuous. In our container
>>> environment, we always use the cpuset cgroup for some critical tasks,
>>> but it is not so convenient to traverse the percpu data of this cpuset
>>> cgroup. We have to do it as follows for this case :
>>>
>>> That's why we prefer to introduce a bpf_for_each_cpu helper. It is
>>> fine if it can be implemented as a kfunc.
>>
>> I think open-coded-iterators is the only acceptable path forward here.
>> Since existing bpf_iter_num doesn't fit due to sparse cpumask,
>> let's introduce bpf_iter_cpumask and few additional kfuncs
>> that return cpu_possible_mask and others.
>
> I agree that this is the correct way to generalize this. The only thing
> that we'll have to figure out is how to generalize treating const struct
> cpumask * objects as kptrs. In sched_ext [0] we export
> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
> return trusted global cpumask kptrs that can then be "released" in
> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
> exists only to appease the verifier that the trusted cpumask kptrs
> aren't being leaked and are having their references "dropped".
>
> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>
> I'd imagine that we have 2 ways forward if we want to enable progs to
> fetch other global cpumasks with static lifetimes (e.g.
> __cpu_possible_mask or nohz.idle_cpus_mask):
>
> 1. The most straightforward thing to do would be to add a new kfunc in
> kernel/bpf/cpumask.c that's a drop-in replacment for
> scx_bpf_put_idle_cpumask():
>
> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
> {}
>
> 2. Another would be to implement something resembling what Yonghong
> suggested in [1], where progs can link against global allocated kptrs
> like:
>
> const struct cpumask *__cpu_possible_mask __ksym;
>
> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
>
> In my opinion (1) is more straightforward, (2) is a better UX.
>
> Note again that both approaches only works for cpumasks with static
> lifetimes. I can't think of a way to treat dynamically allocated struct
> cpumask *objects as kptrs as there's nowhere to put a reference. If
> someone wants to track a dynamically allocated cpumask, they'd have to
> create a kptr out of its container object, and then pass that object's
> cpumask as a const struct cpumask * to other BPF cpumask kfuncs
> (including e.g. the proposed iterator).
>
>> We already have some cpumask support in kernel/bpf/cpumask.c
>> bpf_iter_cpumask will be a natural follow up.
>
> Yes, this should be easy to add.
>
> - David
next prev parent reply other threads:[~2023-08-02 6:54 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-01 14:29 [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yafang Shao
2023-08-01 14:29 ` [RFC PATCH bpf-next 1/3] bpf: Add bpf_for_each_cpu helper Yafang Shao
2023-08-01 14:29 ` [RFC PATCH bpf-next 2/3] cgroup, psi: Init root cgroup psi to psi_system Yafang Shao
2023-08-01 14:29 ` [RFC PATCH bpf-next 3/3] selftests/bpf: Add selftest for for_each_cpu Yafang Shao
2023-08-01 17:53 ` [RFC PATCH bpf-next 0/3] bpf: Add new bpf helper bpf_for_each_cpu Yonghong Song
2023-08-02 2:33 ` Yafang Shao
2023-08-02 2:45 ` Alexei Starovoitov
2023-08-02 2:57 ` Yafang Shao
2023-08-02 3:29 ` David Vernet
2023-08-02 6:54 ` Yonghong Song [this message]
2023-08-02 15:46 ` David Vernet
2023-08-02 16:23 ` Alexei Starovoitov
2023-08-02 16:33 ` Alexei Starovoitov
2023-08-02 17:06 ` David Vernet
2023-08-02 18:13 ` Alexei Starovoitov
2023-08-03 8:21 ` Alan Maguire
2023-08-03 15:22 ` Yonghong Song
2023-08-03 16:10 ` Alan Maguire
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8b6b0703-4ed6-c0cb-c61a-9ebcfb5fe668@linux.dev \
--to=yonghong.song@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=laoar.shao@gmail.com \
--cc=martin.lau@linux.dev \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=void@manifault.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.