From: Stefano Brivio <sbrivio@redhat.com>
To: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org,
"Florian Westphal" <fw@strlen.de>,
"Kadlecsik József" <kadlec@blackhole.kfki.hu>,
"Eric Garver" <eric@garver.life>, "Phil Sutter" <phil@nwl.cc>
Subject: Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
Date: Mon, 10 Feb 2020 16:10:47 +0100 [thread overview]
Message-ID: <20200210161047.370582c5@redhat.com> (raw)
In-Reply-To: <20200207112308.sqtlvbluujlftqz2@salvia>
On Fri, 7 Feb 2020 12:23:08 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Hi Stefano,
>
> A bit of late feedback on your new datastructure.
>
> Did you tests with 8-bits grouping instead of 4-bits?
Yes, at the very beginning, not with the final implementation. It was
somewhat faster (on x86_64, I don't remember how much) for small
numbers of rules, then I thought we would use too much memory, because:
> Assuming a concatenation of 12 bytes (each field 4 bytes, hence 3
> fields):
>
> * Using 4-bits groups: the number of buckets is 2^4 = 16 multiplied
> by the bucket word (assuming one long word, 8 bytes, 64 pipapo
> rules) is 16 * 8 = 128 bytes per group-row in the looking table. Then,
> the number of group-rows is 8 given that 32 bits, then 32 / 4 = 8
> group-rows.
>
> 8 * 128 bytes = 1024 bytes per lookup table.
>
> Assuming 3 fields, then this is 1024 * 3 = 3072 bytes.
>
> * Using 8-bits groups: 2^8 = 256, then 256 * 8 = 2048 bytes per
> group-row. Then, 32 / 8 = 4 group-rows in total.
>
> 4 * 2048 bytes = 8192 bytes per lookup table.
>
> Therefore, 3 * 8192 = 24576 bytes. Still rather small.
>
> This is missing the mapping table that links the lookup tables in the
> memory counting. And I'm assuming that the number of pipapo rules in
> the lookup table fits into 64-bits bucket long word.
...the (reasonable?) worst case I wanted to cover was two IPv6
addresses, one port, one MAC address (in ipset terms
"net,net,port,mac"), with 2 ^ 16 unique, non-overlapping entries each
(or ranges expanding to that amount of rules), because that's what
(single, non-concatenated) ipset "bitmap" types can do.
Also ignoring the mapping table (it's "small"), with 4-bit buckets:
- for the IPv6 addresses, we have 16 buckets, each 2 ^ 16
bits wide, and 32 groups (128 bits / 4 bits), that is, 8MiB in
total
- for the MAC address, 16 buckets, each 2 ^ 16 bits wide, and 12
groups, 1.5MiB
- for the port, 16 buckets, each 2 ^ 12 bits wide, 2 groups, 0.25MiB
that is, 9.75MiB.
With 8-bit buckets: we can just multiply everything by 8 (that is,
2 ^ 8 / 2 ^ 4 / 2, because we have 2 ^ (8 - 4) times the buckets, with
half the groups), 78MiB.
And that started feeling like "a lot". However, I'm probably overdoing
with the worst case -- this was just to explain what brought me to the
4-bit choice, now I start doubting about it.
> Anyway, my understanding is that the more bits you use for grouping,
> the larger the lookup table becomes.
>
> Still, both look very small in terms of memory consumption for these
> days.
>
> I'm just telling this because the C implementation can probably get
> better numbers at the cost of consuming more memory? Probably do this
> at some point?
Another topic is the additional amount of cachelines we would use. I
don't expect that effect to be visible, but I might be wrong.
So yes, I think it's definitely worth a try, thanks for the hint! I'll
try to look into this soon and test it on a few archs (something with
small cachelines, say MIPS r2k, would be worth checking, too).
We could even consider to dynamically adjust group size depending on
the set size, I don't know yet if that gets too convoluted.
> BTW, with a few more knobs it should be possible to integrate better
> this datastructure into the transaction infrastructure, this can be
> done incrementally.
>
> More questions below.
>
> On Wed, Jan 22, 2020 at 12:17:55AM +0100, Stefano Brivio wrote:
> [...]
> > diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
> > new file mode 100644
> > index 000000000000..f0cb1e13af50
> > --- /dev/null
> > +++ b/net/netfilter/nft_set_pipapo.c
> [...]
> > + *
> > + * rule indices in current field: 0 1 2
> > + * map to rules in next field: 0 1 1
> > + *
> > + * the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be
> > + * set.
> > + *
> > + * We can now extend this example to cover the second iteration of the step
> > + * above (lookup and AND bitmap): assuming the port field is
> > + * 2048 < 0 0 5 0 >, with starting result bitmap 0x2, and lookup table
> > + * for "port" field from pre-computation example:
> > + *
> > + * ::
> > + *
> > + * bucket
> > + * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > + * 0 0,1
> > + * 1 0,1
> > + * 2 0 1
> > + * 3 0,1
> > + *
> > + * operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5]
> > + * & 0x3 [bucket 0], resulting bitmap is 0x2.
> > + *
> > + * - if this is the last field in the set, look up the value from the mapping
> > + * array corresponding to the final result bitmap
> > + *
> > + * Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for
> > + * last field from insertion example:
> > + *
> > + * ::
> > + *
> > + * rule indices in last field: 0 1
> > + * map to elements: 0x42 0x66
>
> Should this be instead?
>
> rule indices in last field: 0 1
> map to elements: 0x66 0x42
>
> If the resulting bitmap is 0x2, then this is actually pointing to
> rule index 1 in this lookup table, that is the 2048.
Right! Good catch, thanks.
I swapped the values also in the "insertion" example above. For some
reason, this was correct in the equivalent example of the stand-alone
implementation:
https://pipapo.lameexcu.se/pipapo/tree/pipapo.c#n162
> > + * the matching element is at 0x42.
>
> Hence, the matching 0x42 element.
>
> Otherwise, I don't understand how to interpret the "result bitmap". I
> thought this contains the matching pipapo rule index that is expressed
> as a bitmask.
Yes, that's correct. Let me know if you want me to send a patch or if
you'd rather fix it.
> [...]
> > +static int pipapo_refill(unsigned long *map, int len, int rules,
> > + unsigned long *dst, union nft_pipapo_map_bucket *mt,
> > + bool match_only)
> > +{
> > + unsigned long bitset;
> > + int k, ret = -1;
> > +
> > + for (k = 0; k < len; k++) {
> > + bitset = map[k];
> > + while (bitset) {
> > + unsigned long t = bitset & -bitset;
> > + int r = __builtin_ctzl(bitset);
> > + int i = k * BITS_PER_LONG + r;
> > +
> > + if (unlikely(i >= rules)) {
> > + map[k] = 0;
> > + return -1;
> > + }
> > +
> > + if (unlikely(match_only)) {
>
> Not a big issue, but this branch holds true for the last field, my
> understanding for unlikely() is that it should be used for paths where
> 99.99...% is not going to happen. Not a big issue, just that when
> reading the code I got confused this is actually likely to happen.
You're right, I wanted to make sure we avoid branching for the "common"
case (this early return happens just once), but this is probably an
abuse. I'll look into a more acceptable way to achieve this.
> > + bitmap_clear(map, i, 1);
> > + return i;
> > + }
> > +
> > + ret = 0;
> > +
> > + bitmap_set(dst, mt[i].to, mt[i].n);
> > +
> > + bitset ^= t;
> > + }
> > + map[k] = 0;
> > + }
> > +
> > + return ret;
> > +}>
--
Stefano
next prev parent reply other threads:[~2020-02-10 15:11 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 1/9] netfilter: nf_tables: add nft_setelem_parse_key() Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 2/9] netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 3/9] netfilter: nf_tables: Support for sets with multiple ranged fields Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 4/9] bitmap: Introduce bitmap_cut(): cut bits and shift remaining Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
2020-02-07 11:23 ` Pablo Neira Ayuso
2020-02-10 15:10 ` Stefano Brivio [this message]
2020-02-14 18:16 ` Pablo Neira Ayuso
2020-02-14 19:42 ` Stefano Brivio
2020-02-14 20:42 ` Pablo Neira Ayuso
2020-02-14 23:06 ` Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 6/9] selftests: netfilter: Introduce tests for sets with range concatenation Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 7/9] nft_set_pipapo: Prepare for vectorised implementation: alignment Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 8/9] nft_set_pipapo: Prepare for vectorised implementation: helpers Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 9/9] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
2020-01-27 6:41 ` kbuild test robot
2020-01-27 6:41 ` kbuild test robot
2020-01-27 8:20 ` [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
2020-02-20 10:52 ` Phil Sutter
2020-02-20 11:04 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200210161047.370582c5@redhat.com \
--to=sbrivio@redhat.com \
--cc=eric@garver.life \
--cc=fw@strlen.de \
--cc=kadlec@blackhole.kfki.hu \
--cc=netfilter-devel@vger.kernel.org \
--cc=pablo@netfilter.org \
--cc=phil@nwl.cc \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.