RFC: nftables set selection

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: nftables set selection
@ 2014-01-21 17:07 Patrick McHardy
  2014-01-22 10:12 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick McHardy @ 2014-01-21 17:07 UTC (permalink / raw)
  To: netfilter-devel; +Cc: pablo

Hola,

as promised I've started looking into implementing a better mechanism for
automatic set selection by the kernel.

We have multiple different set implementations in the kernel, which provide
different memory and performance characteristics. Userspace should not
know or hard code these characteristics since they may change and
implementations may get added or removed from the kernel.

So the kernel needs a way to select the ideally best suited implementation
according to characteristics of the data contained in the set and user
preference, like lookup performance, memory usage, updates times.

Wrt. to this problem, we have two fundamentally different situations,
constant sets where the entire content is known at the time the set is
created and non-constant sets, about which in the worst case we know
very little at the time the set is created.

>From what I can tell, there are two possibilities how to do this:

- Dave brought up the idea of simply trying each possible implementation
  and comparing the results. The advantage of this is that we have definite
  information for both constant and non-constant sets, the downsides are
  that the update costs may significantly increase. Its not possible in
  all cases to give an estimate of memory usage just based on the number
  of elements, but the entire set actually has to be built. This is f.i.
  the case for upcoming array mapped trie, where memory usage is dependant
  on the number of clusters in the set,

  Possible refinements of this idea are to only try every N changes or
  when we notice we have to change something anyway, f.i. because of
  too long hash chains.

- My initial idea was to have an abstract description of the data's
  characteristics and have the set implementations return an estimate
  based on this description. This poses two problems: for non-constant
  sets we may not have any data to analyze, so the best we could do is
  either have the user manually specify the expected characteristics,
  or try to deduce them from the ruleset. For both constant and non-
  constant sets we need a notation to describe the data.

Assuming we'd go with the abstract descriptions, the hard part is
getting the information we include right so future implementations
will hopefully have the information they need.

We currently only have two very non-optimized (for our purposes) set
implementations, nft_rbtree and nft_hash. In both cases the number of
elements it pretty much all they need.

I also have two upcoming new set implementations:

- array mapped trie, which is basically a trie with 8 bits per level.
  In this case memory usage doesn't depend on the total amount of keys,
  but on the amount of clusters requiring new leaf nodes.

- offset bitmap: a fixed size bitmap, where the beginning of the range
  is subtracted from the key before the lookup. The costs are constant,
  but depend on the range of the keys.

So, so far we could make a good decision based on the following information:

- number of keys
- key size
- key range
- number of clusters

The number of keys is of course a subset of the number of clusters for
distance 0. To not tie this to the implementation, we could simply
compute the clusters for each possible distance in powers of two up until
the maximum key length.

For non-constant keys, we can simplfy this to have the user specify f.i.
"dense" to indicate that it is expected that many keys will be in proximity.
The number of keys would be an expectation specified by the user. If we
have no information at all, we might simply choose rbtrees, which have
higher costs, but don't degrade as badly as f.i. a badly sized fixed hash.

So the questions are:

- how should we proceed?
- if we go for the set descriptions, what other criteria might be required?
  I'd expect most set implementations to require rather similar information
  to that listed above. What might be missing?

Comments welcome.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: nftables set selection
  2014-01-21 17:07 RFC: nftables set selection Patrick McHardy
@ 2014-01-22 10:12 ` Pablo Neira Ayuso
  2014-01-22 10:51   ` Patrick McHardy
  0 siblings, 1 reply; 5+ messages in thread
From: Pablo Neira Ayuso @ 2014-01-22 10:12 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netfilter-devel

Hola Patrick,

On Tue, Jan 21, 2014 at 05:07:42PM +0000, Patrick McHardy wrote:
> Hola,
>
> as promised I've started looking into implementing a better mechanism for
> automatic set selection by the kernel.
> 
> We have multiple different set implementations in the kernel, which provide
> different memory and performance characteristics. Userspace should not
> know or hard code these characteristics since they may change and
> implementations may get added or removed from the kernel.
> 
> So the kernel needs a way to select the ideally best suited implementation
> according to characteristics of the data contained in the set and user
> preference, like lookup performance, memory usage, updates times.
> 
> Wrt. to this problem, we have two fundamentally different situations,
> constant sets where the entire content is known at the time the set is
> created and non-constant sets, about which in the worst case we know
> very little at the time the set is created.
> 
> From what I can tell, there are two possibilities how to do this:
> 
> - Dave brought up the idea of simply trying each possible implementation
>   and comparing the results. The advantage of this is that we have definite
>   information for both constant and non-constant sets, the downsides are
>   that the update costs may significantly increase. Its not possible in
>   all cases to give an estimate of memory usage just based on the number
>   of elements, but the entire set actually has to be built. This is f.i.
>   the case for upcoming array mapped trie, where memory usage is dependant
>   on the number of clusters in the set,
> 
>   Possible refinements of this idea are to only try every N changes or
>   when we notice we have to change something anyway, f.i. because of
>   too long hash chains.
> 
> - My initial idea was to have an abstract description of the data's
>   characteristics and have the set implementations return an estimate
>   based on this description. This poses two problems: for non-constant
>   sets we may not have any data to analyze, so the best we could do is
>   either have the user manually specify the expected characteristics,
>   or try to deduce them from the ruleset. For both constant and non-
>   constant sets we need a notation to describe the data.
>
> Assuming we'd go with the abstract descriptions, the hard part is
> getting the information we include right so future implementations
> will hopefully have the information they need.
>
> We currently only have two very non-optimized (for our purposes) set
> implementations, nft_rbtree and nft_hash. In both cases the number of
> elements it pretty much all they need.
> 
> I also have two upcoming new set implementations:
> 
> - array mapped trie, which is basically a trie with 8 bits per level.
>   In this case memory usage doesn't depend on the total amount of keys,
>   but on the amount of clusters requiring new leaf nodes.
> 
> - offset bitmap: a fixed size bitmap, where the beginning of the range
>   is subtracted from the key before the lookup. The costs are constant,
>   but depend on the range of the keys.
> 
> So, so far we could make a good decision based on the following information:
> 
> - number of keys
> - key size
> - key range
> - number of clusters
> 
> The number of keys is of course a subset of the number of clusters for
> distance 0. To not tie this to the implementation, we could simply
> compute the clusters for each possible distance in powers of two up until
> the maximum key length.
> 
> For non-constant keys, we can simplfy this to have the user specify f.i.
> "dense" to indicate that it is expected that many keys will be in proximity.
> The number of keys would be an expectation specified by the user. If we
> have no information at all, we might simply choose rbtrees, which have
> higher costs, but don't degrade as badly as f.i. a badly sized fixed hash.
> 
> So the questions are:
> 
> - how should we proceed?
> - if we go for the set descriptions, what other criteria might be required?
>   I'd expect most set implementations to require rather similar information
>   to that listed above. What might be missing?

I also think we need that set description, otherwise empty set
handling is going to get more complicated, since we'll have to
re-evaluate the selected set everytime we add a new element. If we
don't have that abstract description, we would also need to be
inferred from the elements, that can be expensive for large sets.
More specifically, regarding the set description information fields
that you propose, could you elaborate an example of how the selection
would happen based on the ones that you propose above?

Another possibility that I've been considering is that nft makes the
set type selection and tuning from userspace. The idea is that the
kernel provides a listing via netlink of the existing set types,
including a description of the available set type parameters and
performance information (memory consumption per element, lookup time
performance metrics). Based on that information and the abstract set
description, userspace can decide what set type suits better for your
needs.

One of the things that I like the most in nftables is that the
"intelligence" resides in userspace, so, once we have the generic
kernel infrastructure in place, you only need to update your userspace
tool to get new features. Moreover, if we improve the set selection
logic from userspace, users won't need to wait until distributors
upgrade their kernel to get a better set selection/tuning and I think
we'll end up having quite some debate on the set selection algorithm
as it will be the key to improve performance.

So, why don't we perform this set selection/tuning from userspace?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: nftables set selection
  2014-01-22 10:12 ` Pablo Neira Ayuso
@ 2014-01-22 10:51   ` Patrick McHardy
  2014-01-23 10:14     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick McHardy @ 2014-01-22 10:51 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Wed, Jan 22, 2014 at 11:12:59AM +0100, Pablo Neira Ayuso wrote:
> Hola Patrick,
> 
> On Tue, Jan 21, 2014 at 05:07:42PM +0000, Patrick McHardy wrote:
> > So, so far we could make a good decision based on the following information:
> > 
> > - number of keys
> > - key size
> > - key range
> > - number of clusters
> > 
> > The number of keys is of course a subset of the number of clusters for
> > distance 0. To not tie this to the implementation, we could simply
> > compute the clusters for each possible distance in powers of two up until
> > the maximum key length.
> > 
> > For non-constant keys, we can simplfy this to have the user specify f.i.
> > "dense" to indicate that it is expected that many keys will be in proximity.
> > The number of keys would be an expectation specified by the user. If we
> > have no information at all, we might simply choose rbtrees, which have
> > higher costs, but don't degrade as badly as f.i. a badly sized fixed hash.
> > 
> > So the questions are:
> > 
> > - how should we proceed?
> > - if we go for the set descriptions, what other criteria might be required?
> >   I'd expect most set implementations to require rather similar information
> >   to that listed above. What might be missing?
> 
> I also think we need that set description, otherwise empty set
> handling is going to get more complicated, since we'll have to
> re-evaluate the selected set everytime we add a new element. If we
> don't have that abstract description, we would also need to be
> inferred from the elements, that can be expensive for large sets.
> More specifically, regarding the set description information fields
> that you propose, could you elaborate an example of how the selection
> would happen based on the ones that you propose above?

Well, assuming (for a constant set) userspace provides the information:

- key size 4 bytes
- number of keys 64
- key range 192.168.0.0-192.168.1.200
- number of clusters (just for distance 256 included here) 2

So the kernel would ask each set for an appromixation of expected
memory use and performance:

- rbtree:
  mem = sizeof(struct nft_rbtree) + 64 * sizeof(struct nft_ebtree_elem)
      = 3080b
  lookup = O(log N)

- hash:
  size = sizeof(struct nft_hash) + 4/3 * 64 * sizeof(struct hlist_head) +
         64 * sizeof(struct nft_hash_elem)
       = 2746b
  lookup = O(1)

- amtrie:
  size = sizeof(struct nft_amtrie) +
	 sizeof(struct nft_amtrie_node) +
         num_clusters * (key_size - 2) * (sizeof(struct nft_amtrie_node) +
					  sizeof(struct nft_amtrie_node *) +
	 num_clusters * (sizeof(struct nft_amtrie_node)
	
       = 280b

  lookup = O(1)

Some explanation here: the calculation is a worst case calculation since
it can't determine exactly how much memory will be used. Each node on
each level can store up to 256 elements or 8 bits of the key.
Intermediate nodes have one pointer for each child node, leaf nodes only
contain the bitmap. Since we don't know whether the leaf nodes
(clusters of size 256) can share intermediate nodes, we assume they can't.

So there is the tree root, pointing to a root node, with paths to
two levels of intermediate nodes, echo one pointing to a leaf node.

- obmap:
  not usable since range too large

So based on both size and memory usage the array mapped trie would win.
This doesn't factor in constant costs, which are also smaller for the
array mapped trie than for hashes. For this I'll probably just use some
constant factor.

> Another possibility that I've been considering is that nft makes the
> set type selection and tuning from userspace. The idea is that the
> kernel provides a listing via netlink of the existing set types,
> including a description of the available set type parameters and
> performance information (memory consumption per element, lookup time
> performance metrics). Based on that information and the abstract set
> description, userspace can decide what set type suits better for your
> needs.

That's probably even harder than the other way around. As you can see
in the amtrie example, but it applies to obmap even more, there is not
always a constant cost per element. The obmap has completely constant
costs, but is only usable for small ranges, the amtrie has memory costs
which depend on the amount of clusters of size 256.

I think its not too unreasonable to expect that most algorithms we would
use require similar information. Compressed tries would probably slightly
differ, but we'd be either using trees, tries or bitmaps I suppose.

> One of the things that I like the most in nftables is that the
> "intelligence" resides in userspace, so, once we have the generic
> kernel infrastructure in place, you only need to update your userspace
> tool to get new features. Moreover, if we improve the set selection
> logic from userspace, users won't need to wait until distributors
> upgrade their kernel to get a better set selection/tuning and I think
> we'll end up having quite some debate on the set selection algorithm
> as it will be the key to improve performance.
> 
> So, why don't we perform this set selection/tuning from userspace?

I believe statistical properties of data can be described easier than
the properties of the algorithms, especially wrt. memory usage.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: nftables set selection
  2014-01-22 10:51   ` Patrick McHardy
@ 2014-01-23 10:14     ` Pablo Neira Ayuso
  2014-01-23 11:16       ` Patrick McHardy
  0 siblings, 1 reply; 5+ messages in thread
From: Pablo Neira Ayuso @ 2014-01-23 10:14 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netfilter-devel

On Wed, Jan 22, 2014 at 10:51:37AM +0000, Patrick McHardy wrote:
> On Wed, Jan 22, 2014 at 11:12:59AM +0100, Pablo Neira Ayuso wrote:
> 
> Well, assuming (for a constant set) userspace provides the information:
> 
> - key size 4 bytes
> - number of keys 64
> - key range 192.168.0.0-192.168.1.200
> - number of clusters (just for distance 256 included here) 2
> 
> So the kernel would ask each set for an appromixation of expected
> memory use and performance:
> 
> - rbtree:
>   mem = sizeof(struct nft_rbtree) + 64 * sizeof(struct nft_ebtree_elem)
>       = 3080b
>   lookup = O(log N)
> 
> - hash:
>   size = sizeof(struct nft_hash) + 4/3 * 64 * sizeof(struct hlist_head) +
>          64 * sizeof(struct nft_hash_elem)
>        = 2746b
>   lookup = O(1)
> 
> - amtrie:
>   size = sizeof(struct nft_amtrie) +
> 	 sizeof(struct nft_amtrie_node) +
>          num_clusters * (key_size - 2) * (sizeof(struct nft_amtrie_node) +
> 					  sizeof(struct nft_amtrie_node *) +
> 	 num_clusters * (sizeof(struct nft_amtrie_node)
> 	
>        = 280b
> 
>   lookup = O(1)

Thanks for the detailed example.

> Some explanation here: the calculation is a worst case calculation since
> it can't determine exactly how much memory will be used. Each node on
> each level can store up to 256 elements or 8 bits of the key.
>
> Intermediate nodes have one pointer for each child node, leaf nodes only
> contain the bitmap. Since we don't know whether the leaf nodes
> (clusters of size 256) can share intermediate nodes, we assume they can't.
>
> So there is the tree root, pointing to a root node, with paths to
> two levels of intermediate nodes, echo one pointing to a leaf node.
>
> - obmap:
>   not usable since range too large

What information the obmap uses to decide if the range fits?

> So based on both size and memory usage the array mapped trie would win.
> This doesn't factor in constant costs, which are also smaller for the
> array mapped trie than for hashes. For this I'll probably just use some
> constant factor.

I see, with the new amtrie we need the memory consumption
approximation, otherwise I guess we would need to simulate the
addition of that element to know the precise memory consumption that
resulted from inserting that element.

> > Another possibility that I've been considering is that nft makes the
> > set type selection and tuning from userspace. The idea is that the
> > kernel provides a listing via netlink of the existing set types,
> > including a description of the available set type parameters and
> > performance information (memory consumption per element, lookup time
> > performance metrics). Based on that information and the abstract set
> > description, userspace can decide what set type suits better for your
> > needs.
> 
> That's probably even harder than the other way around. As you can see
> in the amtrie example, but it applies to obmap even more, there is not
> always a constant cost per element. The obmap has completely constant
> costs, but is only usable for small ranges, the amtrie has memory costs
> which depend on the amount of clusters of size 256.
>
> I think its not too unreasonable to expect that most algorithms we would
> use require similar information. Compressed tries would probably slightly
> differ, but we'd be either using trees, tries or bitmaps I suppose.
>
> > One of the things that I like the most in nftables is that the
> > "intelligence" resides in userspace, so, once we have the generic
> > kernel infrastructure in place, you only need to update your userspace
> > tool to get new features. Moreover, if we improve the set selection
> > logic from userspace, users won't need to wait until distributors
> > upgrade their kernel to get a better set selection/tuning and I think
> > we'll end up having quite some debate on the set selection algorithm
> > as it will be the key to improve performance.
> > 
> > So, why don't we perform this set selection/tuning from userspace?
> 
> I believe statistical properties of data can be described easier than
> the properties of the algorithms, especially wrt. memory usage.

Wouldn't the approximated memory consumption and lookup be enough to
perform a good set selection from userspace?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RFC: nftables set selection
  2014-01-23 10:14     ` Pablo Neira Ayuso
@ 2014-01-23 11:16       ` Patrick McHardy
  0 siblings, 0 replies; 5+ messages in thread
From: Patrick McHardy @ 2014-01-23 11:16 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Thu, Jan 23, 2014 at 11:14:29AM +0100, Pablo Neira Ayuso wrote:
> On Wed, Jan 22, 2014 at 10:51:37AM +0000, Patrick McHardy wrote:
> > On Wed, Jan 22, 2014 at 11:12:59AM +0100, Pablo Neira Ayuso wrote:
> > 
> Thanks for the detailed example.
> 
> > Some explanation here: the calculation is a worst case calculation since
> > it can't determine exactly how much memory will be used. Each node on
> > each level can store up to 256 elements or 8 bits of the key.
> >
> > Intermediate nodes have one pointer for each child node, leaf nodes only
> > contain the bitmap. Since we don't know whether the leaf nodes
> > (clusters of size 256) can share intermediate nodes, we assume they can't.
> >
> > So there is the tree root, pointing to a root node, with paths to
> > two levels of intermediate nodes, echo one pointing to a leaf node.
> >
> > - obmap:
> >   not usable since range too large
> 
> What information the obmap uses to decide if the range fits?

Simply the size of the range vs. the size of the bitmap (currently
256 elements). The size of the key doesn't matter, if will subtract
the beginning of the range before the lookup (hance offset bitmap).

> > So based on both size and memory usage the array mapped trie would win.
> > This doesn't factor in constant costs, which are also smaller for the
> > array mapped trie than for hashes. For this I'll probably just use some
> > constant factor.
> 
> I see, with the new amtrie we need the memory consumption
> approximation, otherwise I guess we would need to simulate the
> addition of that element to know the precise memory consumption that
> resulted from inserting that element.

Well, actually its worse than that. If the element falls into an existing
range of 256 elements, it won't consume *any* memory. If not, it might
require up to three new nodes and two pointers. So its not dependant on
the amount of elements, but on their proximity.

> > > Another possibility that I've been considering is that nft makes the
> > > set type selection and tuning from userspace. The idea is that the
> > > kernel provides a listing via netlink of the existing set types,
> > > including a description of the available set type parameters and
> > > performance information (memory consumption per element, lookup time
> > > performance metrics). Based on that information and the abstract set
> > > description, userspace can decide what set type suits better for your
> > > needs.
> > 
> > That's probably even harder than the other way around. As you can see
> > in the amtrie example, but it applies to obmap even more, there is not
> > always a constant cost per element. The obmap has completely constant
> > costs, but is only usable for small ranges, the amtrie has memory costs
> > which depend on the amount of clusters of size 256.
> >
> > I think its not too unreasonable to expect that most algorithms we would
> > use require similar information. Compressed tries would probably slightly
> > differ, but we'd be either using trees, tries or bitmaps I suppose.
> >
> > > One of the things that I like the most in nftables is that the
> > > "intelligence" resides in userspace, so, once we have the generic
> > > kernel infrastructure in place, you only need to update your userspace
> > > tool to get new features. Moreover, if we improve the set selection
> > > logic from userspace, users won't need to wait until distributors
> > > upgrade their kernel to get a better set selection/tuning and I think
> > > we'll end up having quite some debate on the set selection algorithm
> > > as it will be the key to improve performance.
> > > 
> > > So, why don't we perform this set selection/tuning from userspace?
> > 
> > I believe statistical properties of data can be described easier than
> > the properties of the algorithms, especially wrt. memory usage.
> 
> Wouldn't the approximated memory consumption and lookup be enough to
> perform a good set selection from userspace?

Well, if we give the approximated data to userspace, we already need
the description. If userspace should approximate itself, we need a
notation to describe the memory and performance characteristics.

I'm having a hard time coming up with something fitting. How would you
describe the amtrie or obmap? Or even worse, a compressed trie?

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-01-23 11:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-21 17:07 RFC: nftables set selection Patrick McHardy
2014-01-22 10:12 ` Pablo Neira Ayuso
2014-01-22 10:51   ` Patrick McHardy
2014-01-23 10:14     ` Pablo Neira Ayuso
2014-01-23 11:16       ` Patrick McHardy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).