Re: [PATCH bpf-next v3 04/11] bpf: add bpf_for_each_map_elem() helper

BPF List
 help / color / mirror / Atom feed

From: Yonghong Song <yhs@fb.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next v3 04/11] bpf: add bpf_for_each_map_elem() helper
Date: Thu, 25 Feb 2021 18:16:09 -0800	[thread overview]
Message-ID: <c042d2c7-a15f-c9b3-be7e-c729c1cf7184@fb.com> (raw)
In-Reply-To: <CAEf4BzZMCOi__1Y86AbQDD_=kgT22G10pJqzEVwF5r37M2CB6A@mail.gmail.com>



On 2/25/21 2:41 PM, Andrii Nakryiko wrote:
> On Thu, Feb 25, 2021 at 1:35 AM Yonghong Song <yhs@fb.com> wrote:
>>
>> The bpf_for_each_map_elem() helper is introduced which
>> iterates all map elements with a callback function. The
>> helper signature looks like
>>    long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
>> and for each map element, the callback_fn will be called. For example,
>> like hashmap, the callback signature may look like
>>    long callback_fn(map, key, val, callback_ctx)
>>
>> There are two known use cases for this. One is from upstream ([1]) where
>> a for_each_map_elem helper may help implement a timeout mechanism
>> in a more generic way. Another is from our internal discussion
>> for a firewall use case where a map contains all the rules. The packet
>> data can be compared to all these rules to decide allow or deny
>> the packet.
>>
>> For array maps, users can already use a bounded loop to traverse
>> elements. Using this helper can avoid using bounded loop. For other
>> type of maps (e.g., hash maps) where bounded loop is hard or
>> impossible to use, this helper provides a convenient way to
>> operate on all elements.
>>
>> For callback_fn, besides map and map element, a callback_ctx,
>> allocated on caller stack, is also passed to the callback
>> function. This callback_ctx argument can provide additional
>> input and allow to write to caller stack for output.
>>
>> If the callback_fn returns 0, the helper will iterate through next
>> element if available. If the callback_fn returns 1, the helper
>> will stop iterating and returns to the bpf program. Other return
>> values are not used for now.
>>
>> Currently, this helper is only available with jit. It is possible
>> to make it work with interpreter with so effort but I leave it
>> as the future work.
>>
>> [1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
> 
> It looks good from the perspective of implementation (though I
> admittedly lost track of all the insn[0|1].imm transformations). But
> see some suggestions below (I hope you can incorporate them).
> 
> Overall, though:
> 
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> 
>>   include/linux/bpf.h            |  13 +++
>>   include/linux/bpf_verifier.h   |   3 +
>>   include/uapi/linux/bpf.h       |  29 ++++-
>>   kernel/bpf/bpf_iter.c          |  16 +++
>>   kernel/bpf/helpers.c           |   2 +
>>   kernel/bpf/verifier.c          | 208 ++++++++++++++++++++++++++++++---
>>   kernel/trace/bpf_trace.c       |   2 +
>>   tools/include/uapi/linux/bpf.h |  29 ++++-
>>   8 files changed, 287 insertions(+), 15 deletions(-)
>>
> 
> [...]
> 
>> @@ -3850,7 +3859,6 @@ union bpf_attr {
>>    *
>>    * long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)
>>    *     Description
>> -
>>    *             Check ctx packet size against exceeding MTU of net device (based
>>    *             on *ifindex*).  This helper will likely be used in combination
>>    *             with helpers that adjust/change the packet size.
>> @@ -3910,6 +3918,24 @@ union bpf_attr {
>>    *             * **BPF_MTU_CHK_RET_FRAG_NEEDED**
>>    *             * **BPF_MTU_CHK_RET_SEGS_TOOBIG**
>>    *
>> + * long bpf_for_each_map_elem(struct bpf_map *map, void *callback_fn, void *callback_ctx, u64 flags)
>> + *     Description
>> + *             For each element in **map**, call **callback_fn** function with
>> + *             **map**, **callback_ctx** and other map-specific parameters.
>> + *             For example, for hash and array maps, the callback signature can
>> + *             be `long callback_fn(map, map_key, map_value, callback_ctx)`.
> 
> I think this is the place to describe all supported maps and
> respective callback signatures. Otherwise users would have to dig into
> kernel sources quite deeply to understand what signature is expected.
> 
> How about something like this.
> 
> Here's a list of supported map types and their respective expected
> callback signatures:
> 
> BPF_MAP_TYPE_A, BPF_MAP_TYPE_B:
>      long (*callback_fn)(struct bpf_map *map, const void *key, void
> *value, void *ctx);
> 
> BPF_MAP_TYPE_C:
>      long (*callback_fn)(struct bpf_map *map, int cpu, const void *key,
> void *value, void *ctx);
> 
> (whatever the right signature for per-cpu iteration is)
> 
> This probably is the right place to also leave notes like below:
> 
> "For per_cpu maps, the map_value is the value on the cpu where the
> bpf_prog is running." (I'd move it out from below to be more visible).
> 
> If we don't leave such documentation, it is going to be a major pain
> for users (and people like us helping them).

Good idea. Will write detailed documentation here.

> 
>> + *             The **callback_fn** should be a static function and
>> + *             the **callback_ctx** should be a pointer to the stack.
>> + *             The **flags** is used to control certain aspects of the helper.
>> + *             Currently, the **flags** must be 0. For per_cpu maps,
>> + *             the map_value is the value on the cpu where the bpf_prog is running.
>> + *
>> + *             If **callback_fn** return 0, the helper will continue to the next
>> + *             element. If return value is 1, the helper will skip the rest of
>> + *             elements and return. Other return values are not used now.
>> + *     Return
>> + *             The number of traversed map elements for success, **-EINVAL** for
>> + *             invalid **flags**.
>>    */
> 
> [...]
> 
>> @@ -1556,6 +1568,19 @@ static int check_subprogs(struct bpf_verifier_env *env)
>>
>>          /* determine subprog starts. The end is one before the next starts */
>>          for (i = 0; i < insn_cnt; i++) {
>> +               if (bpf_pseudo_func(insn + i)) {
>> +                       if (!env->bpf_capable) {
>> +                               verbose(env,
>> +                                       "function pointers are allowed for CAP_BPF and CAP_SYS_ADMIN\n");
>> +                               return -EPERM;
>> +                       }
>> +                       ret = add_subprog(env, i + insn[i].imm + 1);
>> +                       if (ret < 0)
>> +                               return ret;
>> +                       /* remember subprog */
>> +                       insn[i + 1].imm = find_subprog(env, i + insn[i].imm + 1);
> 
> hm... my expectation would be that add_subprog returns >= 0 on
> success, which is an index of subprog, so that precise no one needs to
> call find_subprog yet again (it's already called internally in
> add_subprog). Do you think it would be terrible to do that? It doesn't
> seem like anything would break with that convention.

Will change find_subprog() to return subprogno. It does not break 
existing cases.

> 
>> +                       continue;
>> +               }
>>                  if (!bpf_pseudo_call(insn + i))
>>                          continue;
>>                  if (!env->bpf_capable) {
> 
> [...]
> 
>>   static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>>   {
>>          struct bpf_verifier_state *state = env->cur_state;
>> @@ -5400,8 +5487,22 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>>
>>          state->curframe--;
>>          caller = state->frame[state->curframe];
>> -       /* return to the caller whatever r0 had in the callee */
>> -       caller->regs[BPF_REG_0] = *r0;
>> +       if (callee->in_callback_fn) {
>> +               /* enforce R0 return value range [0, 1]. */
>> +               struct tnum range = tnum_range(0, 1);
>> +
>> +               if (r0->type != SCALAR_VALUE) {
>> +                       verbose(env, "R0 not a scalar value\n");
>> +                       return -EACCES;
>> +               }
>> +               if (!tnum_in(range, r0->var_off)) {
> 
> if (!tnum_in(tnum_range(0, 1), r0->var_off)) should work as well,
> unless you find it less readable (I don't but no strong feeling here)

Will give a try.

> 
> 
>> +                       verbose_invalid_scalar(env, r0, &range, "callback return", "R0");
>> +                       return -EINVAL;
>> +               }
>> +       } else {
>> +               /* return to the caller whatever r0 had in the callee */
>> +               caller->regs[BPF_REG_0] = *r0;
>> +       }
>>
>>          /* Transfer references to the caller */
>>          err = transfer_reference_state(caller, callee);
>> @@ -5456,7 +5557,8 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>>              func_id != BPF_FUNC_map_delete_elem &&
>>              func_id != BPF_FUNC_map_push_elem &&
>>              func_id != BPF_FUNC_map_pop_elem &&
>> -           func_id != BPF_FUNC_map_peek_elem)
>> +           func_id != BPF_FUNC_map_peek_elem &&
>> +           func_id != BPF_FUNC_for_each_map_elem)
>>                  return 0;
>>
>>          if (map == NULL) {
>> @@ -5537,15 +5639,18 @@ static int check_reference_leak(struct bpf_verifier_env *env)
>>          return state->acquired_refs ? -EINVAL : 0;
>>   }
>>
>> -static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
>> +static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>> +                            int *insn_idx_p)
>>   {
>>          const struct bpf_func_proto *fn = NULL;
>>          struct bpf_reg_state *regs;
>>          struct bpf_call_arg_meta meta;
>> +       int insn_idx = *insn_idx_p;
>>          bool changes_data;
>> -       int i, err;
>> +       int i, err, func_id;
>>
>>          /* find function prototype */
>> +       func_id = insn->imm;
>>          if (func_id < 0 || func_id >= __BPF_FUNC_MAX_ID) {
>>                  verbose(env, "invalid func %s#%d\n", func_id_name(func_id),
>>                          func_id);
>> @@ -5641,6 +5746,13 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>>                  return -EINVAL;
>>          }
>>
>> +       if (func_id == BPF_FUNC_for_each_map_elem) {
>> +               err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> 
> so here __check_func_call never updates *insn_idx_p, which means
> check_helper_call() doesn't need int * for instruction index. This
> pointer is just adding to confusion, because it's not used to pass
> value back. So unless I missed something, let's please drop the
> pointer and pass the index by value.

As mentioned in one of my previous emails, *insn_idx_p indeed changed as
the next to-be-checked insn will be the callee.

> 
>> +                                       set_map_elem_callback_state);
>> +               if (err < 0)
>> +                       return -EINVAL;
>> +       }
>> +
>>          /* reset caller saved regs */
>>          for (i = 0; i < CALLER_SAVED_REGS; i++) {
>>                  mark_reg_not_init(env, regs, caller_saved[i]);
> 
> [...]
> 
>> +       case PTR_TO_MAP_KEY:
>>          case PTR_TO_MAP_VALUE:
>>                  /* If the new min/max/var_off satisfy the old ones and
>>                   * everything else matches, we are OK.
>> @@ -10126,10 +10274,9 @@ static int do_check(struct bpf_verifier_env *env)
>>                                  if (insn->src_reg == BPF_PSEUDO_CALL)
>>                                          err = check_func_call(env, insn, &env->insn_idx);
>>                                  else
>> -                                       err = check_helper_call(env, insn->imm, env->insn_idx);
>> +                                       err = check_helper_call(env, insn, &env->insn_idx);
> 
> see, here again. Will env->insn_idx change here? What would that mean?
> Just lots of unnecessary worries.
> 
>>                                  if (err)
>>                                          return err;
>> -
>>                          } else if (opcode == BPF_JA) {
>>                                  if (BPF_SRC(insn->code) != BPF_K ||
>>                                      insn->imm != 0 ||
> 
> [...]
>

next prev parent reply	other threads:[~2021-02-26  2:17 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-25  7:33 [PATCH bpf-next v3 00/11] bpf: add bpf_for_each_map_elem() helper Yonghong Song
2021-02-25  7:33 ` [PATCH bpf-next v3 01/11] bpf: factor out visit_func_call_insn() in check_cfg() Yonghong Song
2021-02-25 21:54   ` Andrii Nakryiko
2021-02-25 22:01     ` Alexei Starovoitov
2021-02-25  7:33 ` [PATCH bpf-next v3 02/11] bpf: factor out verbose_invalid_scalar() Yonghong Song
2021-02-25 21:56   ` Andrii Nakryiko
2021-02-25  7:33 ` [PATCH bpf-next v3 03/11] bpf: refactor check_func_call() to allow callback function Yonghong Song
2021-02-25 22:05   ` Andrii Nakryiko
2021-02-25 22:31     ` Andrii Nakryiko
2021-02-26  0:08       ` Yonghong Song
2021-02-26  1:18         ` Andrii Nakryiko
2021-02-26  0:05     ` Yonghong Song
2021-02-25  7:33 ` [PATCH bpf-next v3 04/11] bpf: add bpf_for_each_map_elem() helper Yonghong Song
2021-02-25 22:41   ` Andrii Nakryiko
2021-02-26  2:16     ` Yonghong Song [this message]
2021-02-26  3:22       ` Yonghong Song
2021-02-26  2:27   ` Cong Wang
2021-02-26  3:27     ` Yonghong Song
2021-02-25  7:33 ` [PATCH bpf-next v3 05/11] bpf: add hashtab support for " Yonghong Song
2021-02-25 22:44   ` Andrii Nakryiko
2021-02-25  7:33 ` [PATCH bpf-next v3 06/11] bpf: add arraymap " Yonghong Song
2021-02-25 22:48   ` Andrii Nakryiko
2021-02-25  7:33 ` [PATCH bpf-next v3 07/11] libbpf: move function is_ldimm64() earlier in libbpf.c Yonghong Song
2021-02-25  7:33 ` [PATCH bpf-next v3 08/11] libbpf: support subprog address relocation Yonghong Song
2021-02-25 23:04   ` Andrii Nakryiko
2021-02-25  7:33 ` [PATCH bpf-next v3 09/11] bpftool: print subprog address properly Yonghong Song
2021-02-25 23:04   ` Andrii Nakryiko
2021-02-25  7:33 ` [PATCH bpf-next v3 10/11] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper Yonghong Song
2021-02-25 23:25   ` Andrii Nakryiko
2021-02-26  3:24     ` Yonghong Song
2021-02-25  7:33 ` [PATCH bpf-next v3 11/11] selftests/bpf: add arraymap " Yonghong Song
2021-02-25 23:26   ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c042d2c7-a15f-c9b3-be7e-c729c1cf7184@fb.com \
    --to=yhs@fb.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox