Re: [PATCH v2 bpf-next 1/4] bpf: Add bloom filter map implementation

BPF List
 help / color / mirror / Atom feed

From: Joanne Koong <joannekoong@fb.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH v2 bpf-next 1/4] bpf: Add bloom filter map implementation
Date: Mon, 20 Sep 2021 15:52:50 -0700	[thread overview]
Message-ID: <17d7b319-01d0-163e-57b6-c385d38cc9ad@fb.com> (raw)
In-Reply-To: <CAEf4BzaA2QCmcc0nZqNbAqMdabqhjE5X_Nh59QjP8kd=iGH5GA@mail.gmail.com>

My previous email replied to Alexei's email before I saw Andrii's new 
email, so please
feel free to disregard my previous email.

On 9/20/21 1:58 PM, Andrii Nakryiko wrote:

> On Fri, Sep 17, 2021 at 6:08 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, Sep 13, 2021 at 09:04:30PM -0700, Joanne Koong wrote:
>>> +
>>> +/* For bloom filter maps, the next 4 bits represent how many hashes to use.
>>> + * The maximum number of hash functions supported is 15. If this is not set,
>>> + * the default number of hash functions used will be 5.
>>> + */
>>> +     BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
>>> +     BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
>>> +     BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
>>> +     BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),
>> The bit selection is unintuitive.
>> Since key_size has to be zero may be used that instead to indicate the number of hash
>> functions in the rare case when 5 is not good enough?
> Hm... I was initially thinking about proposing something like that,
> but it felt a bit ugly at the time. But now thinking about this a bit
> more, I think this would be a bit more meaningful if we change the
> terminology a bit. Instead of saying that Bloom filter has values and
> no keys, we actually have keys and no values. So all those bytes that
> are hashed are treated as keys (which is actually how sets are
> implemented on top of maps, where you have keys and no values, or at
> least the value is always true).
>
> So with that we'll have key/key_size to specify number of bytes that
> needs to be hashed (and it's type info). And then we can squint a bit
> and say that number of hashes are specified by value_size, as in
> values are those nr_hash bits that we set in Bloom filter.
>
> Still a bit of terminology stretch, but won't necessitate those
> specialized fields just for Bloom filter map. But if default value is
> going to be good enough for most cases and most cases won't need to
> adjust number of hashes, this seems to be pretty clean to me.

With having bloom filter map keys instead of values,  I think this would
lead to messier code in the kernel for handling map_lookup_elem
and map_update_elem calls, due to the fact that the bloom filter map
is a non-associative map and the current APIs for non-associative map types
(peek_elem/push_elem/pop_elem) all have the map data as the value and
not the key.

For example, for map_update_elem, the API from the eBPF program side is

int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 
flags);

This would require us to either

a) Add some custom logic in syscall.c so that we bypass the 
copy_from_bpfptr call on
bloom filter map values (necessary because memcpying 0 bytes still 
requires the src pointer
to be valid), which would allow us to pass in a NULL value

b) Add a new function like

int (*map_push_key)(struct bpf_map *map, void *key, u64 flags)

that eBPF programs can call instead of map_update_elem.

or

c) Try to repurpose the existing map_push_elem API by passing in the key 
instead of the value,
which would lead to inconsistent use of the API

I think if we could change the non-associative map types (currently only 
stack maps and queue
maps, I believe) to have their data be a key instead of a value, and 
have the pop/peek APIs use
keys instead of values, then this would be cleaner, since we could then 
just use the existing peek/pop
APIs.

>> Or use inner_map_fd since there is no possibility of having an inner map in bloomfilter.
>> It could be a union:
>>      __u32   max_entries;    /* max number of entries in a map */
>>      __u32   map_flags;      /* BPF_MAP_CREATE related
>>                               * flags defined above.
>>                               */
>>      union {
>>         __u32  inner_map_fd;   /* fd pointing to the inner map */
>>         __u32  nr_hash_funcs;  /* or number of hash functions */
>>      };
> This works as well. A bit more Bloom filter-only terminology
> throughout UAPI and libbpf, but I'd be fine with that as well.
>
Great, it looks like this is the consensus - I will go with this option 
for v3!
>>      __u32   numa_node;      /* numa node */
>>
>>> +struct bpf_bloom_filter {
>>> +     struct bpf_map map;
>>> +     u32 bit_array_mask;
>>> +     u32 hash_seed;
>>> +     /* If the size of the values in the bloom filter is u32 aligned,
>>> +      * then it is more performant to use jhash2 as the underlying hash
>>> +      * function, else we use jhash. This tracks the number of u32s
>>> +      * in an u32-aligned value size. If the value size is not u32 aligned,
>>> +      * this will be 0.
>>> +      */
>>> +     u32 aligned_u32_count;
>> what is the performance difference?
>> May be we enforce 4-byte sized value for simplicity?
> This might be a bit too surprising, especially if keys are just some
> strings, where people might not expect that it has to 4-byte multiple
> size. And debugging this without extra tooling (like retsnoop) is
> going to be nightmarish.
>
> If the performance diff is huge and that if/else logic is
> unacceptable, we can also internally pad with up to 3 zero bytes and
> include those into the hash.
I think the if/else logic is unavoidable if we support non 4-byte 
aligned value sizes,
unless we are okay with truncating any remainder bytes of non 4-byte 
aligned values
and stipulating that a bloom filter map value size has to be greater 
than 4 bytes (these
conditions would allow us to use jhash2 for every value without an 
if/else check). If we
internally pad, we will have to pad on every update and lookup, which 
would also
require an if/else.
Thanks for the comments and reviews, Alexei and Andrii. They are much 
appreciated!

next prev parent reply	other threads:[~2021-09-20 22:54 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-14  4:04 [PATCH v2 bpf-next 0/4] Implement bloom filter map Joanne Koong
2021-09-14  4:04 ` [PATCH v2 bpf-next 1/4] bpf: Add bloom filter map implementation Joanne Koong
2021-09-17 17:01   ` Alexei Starovoitov
2021-09-20 20:58     ` Andrii Nakryiko
2021-09-20 22:52       ` Joanne Koong [this message]
2021-09-20 23:21         ` Andrii Nakryiko
2021-09-20 21:03     ` Joanne Koong
2021-09-17 21:48   ` Andrii Nakryiko
2021-09-14  4:04 ` [PATCH v2 bpf-next 2/4] selftests/bpf: Add bloom filter map test cases Joanne Koong
2021-09-14  4:04 ` [PATCH v2 bpf-next 3/4] bpf/benchs: Add benchmark test for bloom filter maps Joanne Koong
2021-09-14  4:04 ` [PATCH v2 bpf-next 4/4] bpf/benchs: Add benchmarks for comparing hashmap lookups with vs. without bloom filter Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=17d7b319-01d0-163e-57b6-c385d38cc9ad@fb.com \
    --to=joannekoong@fb.com \
    --cc=Kernel-team@fb.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=bpf@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox