Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stefano Brivio <sbrivio@redhat.com>
To: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org,
	"Florian Westphal" <fw@strlen.de>,
	"Kadlecsik József" <kadlec@blackhole.kfki.hu>,
	"Eric Garver" <eric@garver.life>, "Phil Sutter" <phil@nwl.cc>
Subject: Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges
Date: Sat, 15 Feb 2020 00:06:01 +0100	[thread overview]
Message-ID: <20200215000601.12ab6626@elisabeth> (raw)
In-Reply-To: <20200214204225.dh3ubs67vorh2ail@salvia>

On Fri, 14 Feb 2020 21:42:25 +0100
Pablo Neira Ayuso <pablo@netfilter.org> wrote:

> On Fri, Feb 14, 2020 at 08:42:13PM +0100, Stefano Brivio wrote:
> > On Fri, 14 Feb 2020 19:16:34 +0100
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:  
> [...]
> > > You refer to a property that says that you can split a range into a
> > > 2*n netmasks IIRC. Do you know what is the worst case when splitting
> > > ranges?  
> > 
> > I'm not sure I got your question: that is exactly the worst case, i.e.
> > we can have _up to_ 2 * n netmasks (hence rules) given a range of n
> > bits. There's an additional upper bound on this, given by the address
> > space, but single fields in a concatenation can overlap.
> > 
> > For example, we can have up to 128 rules for an IPv6 range where at
> > least 64 bits differ between the endpoints, and which would contain
> > 2 ^ 64 addresses. Or, say, the IPv4 range 1.2.3.4 - 255.255.0.2 is
> > expressed by 42 rules.
> > 
> > By the way, 0.0.0.1 - 255.255.255.254 takes 62 rules, so we can
> > *probably* say it's 2 * n - 2, but I don't have a formal proof for that.  
> 
> By "splitting" I was actually refering to "expanding", so you're
> replying here to my worst-case range-to-rules expansion question.
> 
> > I have a couple of ways in mind to get that down to n / 2, but it's not
> > straightforward and it will take me some time (assuming it makes
> > sense). For the n bound, we can introduce negations (proof in
> > literature), and I have some kind of ugly prototype. For the n / 2
> > bound, I'd need some auxiliary data structure to keep insertion
> > invertible.  
> 
> OK, so there is room to improve the "rule expansion" logic. I didn't
> spend much time on that front yet.

I wrote some (barely understandable if at all) notes here:
	https://pipapo.lameexcu.se/pipapo/tree/pipapo.c#n271

but I can't guarantee at this stage that it can all be implemented, or
not necessarily in that way. Should you ever (try to) read it, take it
as a rough, buggy draft.

> > In practice, the "average" case is much less, but to define it we would
> > first need to agree on what are the actual components of the
> > multivariate distribution... size and start? Is it a Poisson
> > distribution then? After spending some time on this and disagreeing
> > with myself I'd shyly recommend to skip the topic. :)  
> 
> Yes, I agree to stick to something relatively simple and good is just
> fine.
> 
> > > There is no ipset set like this, but I agree usecase might happen.  
> > 
> > Actually, for ipset, a "net,port,net,port" type was proposed
> > (netfilter-devel <20181216213039.399-1-oliver@uptheinter.net>), but when
> > József enquired about the intended use case, none was given. So maybe
> > this whole "net,net,port,mac" story makes even less sense.  
> 
> Would it make sense to you to restrict pipapo to 3 fields until there
> is someone with a usecase for this?

I wouldn't. I'd rather implement the dynamic switch of bucket size if
this is the alternative. I think the current version is rather generic,
doesn't use too much memory, and is also reasonably tested, so I don't
feel like taking working functionality away.

> [...]
> > > The per-cpu scratch index is only required if we cannot fit in the
> > > "result bitmap" into the stack, right?  
> > 
> > Right.
> >   
> > > Probably up to 256 bytes result bitmap in the stack is reasonable?
> > > That makes 8192 pipapo rules. There will be no need to disable bh and
> > > make use of the percpu scratchpad area in that case.  
> > 
> > Right -- the question is whether that would mean yet another
> > implementation for the lookup function.  
> 
> This would need another lookup function that can be selected from
> control plane path. The set size and the range-to-rule expansion
> worst-case can tell us if it would fit into the stack. It's would be
> just one extra lookup function for this case, ~80-100 LOC.

Ah, you're right, it's not much. On the other hand, advantages seems to
be:

- dropping the local_bh_{disable,enable}() pair: I can't find numbers
  right now, but the difference in pps between v2 (when Florian pointed
  out it was needed) and v1 wasn't statistically relevant

- avoiding the access to per-cpu allocated areas, here I have
  before/after numbers just for net,mac with 1000 entries on the AVX2
  implementation (AMD 7351, one thread), it didn't look significant,
  3.99Mpps instead of 3.90Mpps. Maybe it would make a difference when
  tested on more threads though (as a reference, same CPU, 10 threads,
  as of v2: 38.02Mpps) or with smaller L2

- memory usage (and how that affects locality on some archs), but if we
  do this for cases with small numbers of entries, it's probably not a
  big difference either

Anyway, I guess it's worth a try -- I'll try to find some time for it.

> > > If adjusting the code to deal with variable length "pipapo word" size
> > > is not too convoluted, then you could just deal with the variable word
> > > size from the insert / delete / get (slow) path and register one
> > > lookup function for the version that is optimized for this pipapo word
> > > size.  
> > 
> > Yes, I like this a lot -- we would also need one function to rebuild
> > tables when the word size changes, but that sounds almost trivial.
> > Changes for the slow path are actually rather simple.
> >
> > Still, I start doubting quite heavily that my original worst case is
> > reasonable. If we stick to the one you mentioned, or even something in
> > between, it makes no sense to keep 4-bit buckets.  
> 
> OK, then moving to 8-bits will probably remove a bit of code which is
> dealing with "nibbles".

Right now I have:
	 1 file changed, 16 insertions(+), 37 deletions(-)

but okay, I'll give the dynamic switch a try, first.

> > By the way, I went ahead and tried the 8-bit bucket version of the C
> > implementation only, on my usual x86_64 box (one thread, AMD Epyc 7351).
> > I think it's worth it:
> > 
> >                 4-bit       8-bit
> > net,port
> >  1000 entries   2304165pps  2901299pps
> > port,net
> >  100 entries    4131471pps  4751247pps
> > net6,port
> >  1000 entries   1092557pps  1651037pps
> > port,proto
> >  30000 entries   284147pps   449665pps
> > net6,port,mac
> >  10 entries     2082880pps  2762291pps
> > net6,port,mac,proto
> >  1000 entries    783810pps  1195823pps
> > net,mac
> >  1000 entries   1279122pps  1934003pps  
> 
> Assuming the same concatenation type, larger bucket size makes pps
> drop in the C implementation?

Uh, no, they go up with larger buckets, in the figures above we have
+26%, +15%, +51%, +58%, +33%, +53%, +51%.

> > I would now proceed extending this to the AVX2 implementation and (once
> > I finish it) to the NEON one, I actually expect bigger gains there.  
> 
> Good. BTW, probably you can add a new NFT_SET_CLASS_JIT class that
> comes becomes NFT_SET_CLASS_O_1 to make the set routine that selects
> the set pick the jit version instead.

It's not exactly JIT though :) it's hand-written assembly. It's coming
up a bit nicer for NEON as I can use compiler intrinsics there.

Maybe I didn't understand your point, but I don't think this is needed.
In 9/9 of v4 I simply had, in nft_pipapo_avx2_estimate():

	if (!boot_cpu_has(X86_FEATURE_AVX2) || !boot_cpu_has(X86_FEATURE_AVX))
		return false;

> > > Probably adding helper function to deal with pipapo words would help
> > > to prepare for such update in the future. There is the ->estimate
> > > function that allows to calculate for the best word size depending on
> > > all the information this gets from the set definition.  
> > 
> > Hm, I really think it should be kind of painless to make this dynamic
> > on insertion/deletion.  
> 
> OK, good. How would you like to proceed?

Let me give a shot at the dynamic bucket size switching. If that looks
reasonable, I'd proceed extending that to the AVX2 implementation,
finish the NEON implementation including it right away, and prepare a
series. If it doesn't, I'll report back :)

-- 
Stefano

next prev parent reply	other threads:[~2020-02-14 23:06 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-21 23:17 [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 1/9] netfilter: nf_tables: add nft_setelem_parse_key() Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 2/9] netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 3/9] netfilter: nf_tables: Support for sets with multiple ranged fields Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 4/9] bitmap: Introduce bitmap_cut(): cut bits and shift remaining Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges Stefano Brivio
2020-02-07 11:23   ` Pablo Neira Ayuso
2020-02-10 15:10     ` Stefano Brivio
2020-02-14 18:16       ` Pablo Neira Ayuso
2020-02-14 19:42         ` Stefano Brivio
2020-02-14 20:42           ` Pablo Neira Ayuso
2020-02-14 23:06             ` Stefano Brivio [this message]
2020-01-21 23:17 ` [PATCH nf-next v4 6/9] selftests: netfilter: Introduce tests for sets with range concatenation Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 7/9] nft_set_pipapo: Prepare for vectorised implementation: alignment Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 8/9] nft_set_pipapo: Prepare for vectorised implementation: helpers Stefano Brivio
2020-01-21 23:17 ` [PATCH nf-next v4 9/9] nft_set_pipapo: Introduce AVX2-based lookup implementation Stefano Brivio
2020-01-27  6:41   ` kbuild test robot
2020-01-27  6:41     ` kbuild test robot
2020-01-27  8:20 ` [PATCH nf-next v4 0/9] nftables: Set implementation for arbitrary concatenation of ranges Pablo Neira Ayuso
2020-02-20 10:52 ` Phil Sutter
2020-02-20 11:04   ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200215000601.12ab6626@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=eric@garver.life \
    --cc=fw@strlen.de \
    --cc=kadlec@blackhole.kfki.hu \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pablo@netfilter.org \
    --cc=phil@nwl.cc \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.