From: Yonghong Song <yhs@fb.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
Cong Wang <xiyou.wangcong@gmail.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next v3 04/11] bpf: add bpf_for_each_map_elem() helper
Date: Thu, 25 Feb 2021 18:16:09 -0800 [thread overview]
Message-ID: <c042d2c7-a15f-c9b3-be7e-c729c1cf7184@fb.com> (raw)
In-Reply-To: <CAEf4BzZMCOi__1Y86AbQDD_=kgT22G10pJqzEVwF5r37M2CB6A@mail.gmail.com>
On 2/25/21 2:41 PM, Andrii Nakryiko wrote:
> On Thu, Feb 25, 2021 at 1:35 AM Yonghong Song <yhs@fb.com> wrote:
>>
>> The bpf_for_each_map_elem() helper is introduced which
>> iterates all map elements with a callback function. The
>> helper signature looks like
>> long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
>> and for each map element, the callback_fn will be called. For example,
>> like hashmap, the callback signature may look like
>> long callback_fn(map, key, val, callback_ctx)
>>
>> There are two known use cases for this. One is from upstream ([1]) where
>> a for_each_map_elem helper may help implement a timeout mechanism
>> in a more generic way. Another is from our internal discussion
>> for a firewall use case where a map contains all the rules. The packet
>> data can be compared to all these rules to decide allow or deny
>> the packet.
>>
>> For array maps, users can already use a bounded loop to traverse
>> elements. Using this helper can avoid using bounded loop. For other
>> type of maps (e.g., hash maps) where bounded loop is hard or
>> impossible to use, this helper provides a convenient way to
>> operate on all elements.
>>
>> For callback_fn, besides map and map element, a callback_ctx,
>> allocated on caller stack, is also passed to the callback
>> function. This callback_ctx argument can provide additional
>> input and allow to write to caller stack for output.
>>
>> If the callback_fn returns 0, the helper will iterate through next
>> element if available. If the callback_fn returns 1, the helper
>> will stop iterating and returns to the bpf program. Other return
>> values are not used for now.
>>
>> Currently, this helper is only available with jit. It is possible
>> to make it work with interpreter with so effort but I leave it
>> as the future work.
>>
>> [1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>
> It looks good from the perspective of implementation (though I
> admittedly lost track of all the insn[0|1].imm transformations). But
> see some suggestions below (I hope you can incorporate them).
>
> Overall, though:
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
>> include/linux/bpf.h | 13 +++
>> include/linux/bpf_verifier.h | 3 +
>> include/uapi/linux/bpf.h | 29 ++++-
>> kernel/bpf/bpf_iter.c | 16 +++
>> kernel/bpf/helpers.c | 2 +
>> kernel/bpf/verifier.c | 208 ++++++++++++++++++++++++++++++---
>> kernel/trace/bpf_trace.c | 2 +
>> tools/include/uapi/linux/bpf.h | 29 ++++-
>> 8 files changed, 287 insertions(+), 15 deletions(-)
>>
>
> [...]
>
>> @@ -3850,7 +3859,6 @@ union bpf_attr {
>> *
>> * long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
>> * Description
>> -
>> * Check ctx packet size against exceeding MTU of net device (based
>> * on *ifindex*). This helper will likely be used in combination
>> * with helpers that adjust/change the packet size.
>> @@ -3910,6 +3918,24 @@ union bpf_attr {
>> * * **BPF_MTU_CHK_RET_FRAG_NEEDED**
>> * * **BPF_MTU_CHK_RET_SEGS_TOOBIG**
>> *
>> + * long bpf_for_each_map_elem(struct bpf_map *map, void *callback_fn, void *callback_ctx, u64 flags)
>> + * Description
>> + * For each element in **map**, call **callback_fn** function with
>> + * **map**, **callback_ctx** and other map-specific parameters.
>> + * For example, for hash and array maps, the callback signature can
>> + * be `long callback_fn(map, map_key, map_value, callback_ctx)`.
>
> I think this is the place to describe all supported maps and
> respective callback signatures. Otherwise users would have to dig into
> kernel sources quite deeply to understand what signature is expected.
>
> How about something like this.
>
> Here's a list of supported map types and their respective expected
> callback signatures:
>
> BPF_MAP_TYPE_A, BPF_MAP_TYPE_B:
> long (*callback_fn)(struct bpf_map *map, const void *key, void
> *value, void *ctx);
>
> BPF_MAP_TYPE_C:
> long (*callback_fn)(struct bpf_map *map, int cpu, const void *key,
> void *value, void *ctx);
>
> (whatever the right signature for per-cpu iteration is)
>
> This probably is the right place to also leave notes like below:
>
> "For per_cpu maps, the map_value is the value on the cpu where the
> bpf_prog is running." (I'd move it out from below to be more visible).
>
> If we don't leave such documentation, it is going to be a major pain
> for users (and people like us helping them).
Good idea. Will write detailed documentation here.
>
>> + * The **callback_fn** should be a static function and
>> + * the **callback_ctx** should be a pointer to the stack.
>> + * The **flags** is used to control certain aspects of the helper.
>> + * Currently, the **flags** must be 0. For per_cpu maps,
>> + * the map_value is the value on the cpu where the bpf_prog is running.
>> + *
>> + * If **callback_fn** return 0, the helper will continue to the next
>> + * element. If return value is 1, the helper will skip the rest of
>> + * elements and return. Other return values are not used now.
>> + * Return
>> + * The number of traversed map elements for success, **-EINVAL** for
>> + * invalid **flags**.
>> */
>
> [...]
>
>> @@ -1556,6 +1568,19 @@ static int check_subprogs(struct bpf_verifier_env *env)
>>
>> /* determine subprog starts. The end is one before the next starts */
>> for (i = 0; i < insn_cnt; i++) {
>> + if (bpf_pseudo_func(insn + i)) {
>> + if (!env->bpf_capable) {
>> + verbose(env,
>> + "function pointers are allowed for CAP_BPF and CAP_SYS_ADMIN\n");
>> + return -EPERM;
>> + }
>> + ret = add_subprog(env, i + insn[i].imm + 1);
>> + if (ret < 0)
>> + return ret;
>> + /* remember subprog */
>> + insn[i + 1].imm = find_subprog(env, i + insn[i].imm + 1);
>
> hm... my expectation would be that add_subprog returns >= 0 on
> success, which is an index of subprog, so that precise no one needs to
> call find_subprog yet again (it's already called internally in
> add_subprog). Do you think it would be terrible to do that? It doesn't
> seem like anything would break with that convention.
Will change find_subprog() to return subprogno. It does not break
existing cases.
>
>> + continue;
>> + }
>> if (!bpf_pseudo_call(insn + i))
>> continue;
>> if (!env->bpf_capable) {
>
> [...]
>
>> static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>> {
>> struct bpf_verifier_state *state = env->cur_state;
>> @@ -5400,8 +5487,22 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>>
>> state->curframe--;
>> caller = state->frame[state->curframe];
>> - /* return to the caller whatever r0 had in the callee */
>> - caller->regs[BPF_REG_0] = *r0;
>> + if (callee->in_callback_fn) {
>> + /* enforce R0 return value range [0, 1]. */
>> + struct tnum range = tnum_range(0, 1);
>> +
>> + if (r0->type != SCALAR_VALUE) {
>> + verbose(env, "R0 not a scalar value\n");
>> + return -EACCES;
>> + }
>> + if (!tnum_in(range, r0->var_off)) {
>
> if (!tnum_in(tnum_range(0, 1), r0->var_off)) should work as well,
> unless you find it less readable (I don't but no strong feeling here)
Will give a try.
>
>
>> + verbose_invalid_scalar(env, r0, &range, "callback return", "R0");
>> + return -EINVAL;
>> + }
>> + } else {
>> + /* return to the caller whatever r0 had in the callee */
>> + caller->regs[BPF_REG_0] = *r0;
>> + }
>>
>> /* Transfer references to the caller */
>> err = transfer_reference_state(caller, callee);
>> @@ -5456,7 +5557,8 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>> func_id != BPF_FUNC_map_delete_elem &&
>> func_id != BPF_FUNC_map_push_elem &&
>> func_id != BPF_FUNC_map_pop_elem &&
>> - func_id != BPF_FUNC_map_peek_elem)
>> + func_id != BPF_FUNC_map_peek_elem &&
>> + func_id != BPF_FUNC_for_each_map_elem)
>> return 0;
>>
>> if (map == NULL) {
>> @@ -5537,15 +5639,18 @@ static int check_reference_leak(struct bpf_verifier_env *env)
>> return state->acquired_refs ? -EINVAL : 0;
>> }
>>
>> -static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
>> +static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>> + int *insn_idx_p)
>> {
>> const struct bpf_func_proto *fn = NULL;
>> struct bpf_reg_state *regs;
>> struct bpf_call_arg_meta meta;
>> + int insn_idx = *insn_idx_p;
>> bool changes_data;
>> - int i, err;
>> + int i, err, func_id;
>>
>> /* find function prototype */
>> + func_id = insn->imm;
>> if (func_id < 0 || func_id >= __BPF_FUNC_MAX_ID) {
>> verbose(env, "invalid func %s#%d\n", func_id_name(func_id),
>> func_id);
>> @@ -5641,6 +5746,13 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>> return -EINVAL;
>> }
>>
>> + if (func_id == BPF_FUNC_for_each_map_elem) {
>> + err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
>
> so here __check_func_call never updates *insn_idx_p, which means
> check_helper_call() doesn't need int * for instruction index. This
> pointer is just adding to confusion, because it's not used to pass
> value back. So unless I missed something, let's please drop the
> pointer and pass the index by value.
As mentioned in one of my previous emails, *insn_idx_p indeed changed as
the next to-be-checked insn will be the callee.
>
>> + set_map_elem_callback_state);
>> + if (err < 0)
>> + return -EINVAL;
>> + }
>> +
>> /* reset caller saved regs */
>> for (i = 0; i < CALLER_SAVED_REGS; i++) {
>> mark_reg_not_init(env, regs, caller_saved[i]);
>
> [...]
>
>> + case PTR_TO_MAP_KEY:
>> case PTR_TO_MAP_VALUE:
>> /* If the new min/max/var_off satisfy the old ones and
>> * everything else matches, we are OK.
>> @@ -10126,10 +10274,9 @@ static int do_check(struct bpf_verifier_env *env)
>> if (insn->src_reg == BPF_PSEUDO_CALL)
>> err = check_func_call(env, insn, &env->insn_idx);
>> else
>> - err = check_helper_call(env, insn->imm, env->insn_idx);
>> + err = check_helper_call(env, insn, &env->insn_idx);
>
> see, here again. Will env->insn_idx change here? What would that mean?
> Just lots of unnecessary worries.
>
>> if (err)
>> return err;
>> -
>> } else if (opcode == BPF_JA) {
>> if (BPF_SRC(insn->code) != BPF_K ||
>> insn->imm != 0 ||
>
> [...]
>
next prev parent reply other threads:[~2021-02-26 2:17 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-25 7:33 [PATCH bpf-next v3 00/11] bpf: add bpf_for_each_map_elem() helper Yonghong Song
2021-02-25 7:33 ` [PATCH bpf-next v3 01/11] bpf: factor out visit_func_call_insn() in check_cfg() Yonghong Song
2021-02-25 21:54 ` Andrii Nakryiko
2021-02-25 22:01 ` Alexei Starovoitov
2021-02-25 7:33 ` [PATCH bpf-next v3 02/11] bpf: factor out verbose_invalid_scalar() Yonghong Song
2021-02-25 21:56 ` Andrii Nakryiko
2021-02-25 7:33 ` [PATCH bpf-next v3 03/11] bpf: refactor check_func_call() to allow callback function Yonghong Song
2021-02-25 22:05 ` Andrii Nakryiko
2021-02-25 22:31 ` Andrii Nakryiko
2021-02-26 0:08 ` Yonghong Song
2021-02-26 1:18 ` Andrii Nakryiko
2021-02-26 0:05 ` Yonghong Song
2021-02-25 7:33 ` [PATCH bpf-next v3 04/11] bpf: add bpf_for_each_map_elem() helper Yonghong Song
2021-02-25 22:41 ` Andrii Nakryiko
2021-02-26 2:16 ` Yonghong Song [this message]
2021-02-26 3:22 ` Yonghong Song
2021-02-26 2:27 ` Cong Wang
2021-02-26 3:27 ` Yonghong Song
2021-02-25 7:33 ` [PATCH bpf-next v3 05/11] bpf: add hashtab support for " Yonghong Song
2021-02-25 22:44 ` Andrii Nakryiko
2021-02-25 7:33 ` [PATCH bpf-next v3 06/11] bpf: add arraymap " Yonghong Song
2021-02-25 22:48 ` Andrii Nakryiko
2021-02-25 7:33 ` [PATCH bpf-next v3 07/11] libbpf: move function is_ldimm64() earlier in libbpf.c Yonghong Song
2021-02-25 7:33 ` [PATCH bpf-next v3 08/11] libbpf: support subprog address relocation Yonghong Song
2021-02-25 23:04 ` Andrii Nakryiko
2021-02-25 7:33 ` [PATCH bpf-next v3 09/11] bpftool: print subprog address properly Yonghong Song
2021-02-25 23:04 ` Andrii Nakryiko
2021-02-25 7:33 ` [PATCH bpf-next v3 10/11] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper Yonghong Song
2021-02-25 23:25 ` Andrii Nakryiko
2021-02-26 3:24 ` Yonghong Song
2021-02-25 7:33 ` [PATCH bpf-next v3 11/11] selftests/bpf: add arraymap " Yonghong Song
2021-02-25 23:26 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c042d2c7-a15f-c9b3-be7e-c729c1cf7184@fb.com \
--to=yhs@fb.com \
--cc=andrii.nakryiko@gmail.com \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox