Ottawa and slow hash-table resize

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Ottawa and slow hash-table resize
@ 2015-02-23 18:49 Paul E. McKenney
  2015-02-23 19:12 ` josh
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Paul E. McKenney @ 2015-02-23 18:49 UTC (permalink / raw)
  To: alexei.starovoitov, herbert, tgraf, kaber, davem, ying.xue
  Cc: netdev, netfilter-devel, josh

Hello!

Alexei mentioned that there was some excitement a couple of weeks ago in
Ottawa, something about the resizing taking forever when there were large
numbers of concurrent additions.  One approach comes to mind:

o	Currently, the hash table does not allow additions concurrently
	with resize operations.  One way to allow this would be to
	have the addition operations add to the new hash table at the
	head of the lists.  This would clearly require also updating the
	pointers used to control the unzip operation.

o	Count the number of entries added during the resize operation.
	Then, at the end of the resize operation, if enough entries have
	been added, do a resize, but by multiple factors of two if
	need be.

This should allow the table to take arbitrarily large numbers of updates
during a resize operation.  There are some other possibilities if this
approach does not work out.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 18:49 Ottawa and slow hash-table resize Paul E. McKenney
@ 2015-02-23 19:12 ` josh
  2015-02-23 21:03   ` Thomas Graf
  2015-02-23 21:00 ` Thomas Graf
  2015-02-24  8:59 ` Thomas Graf
  2 siblings, 1 reply; 33+ messages in thread
From: josh @ 2015-02-23 19:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: alexei.starovoitov, herbert, tgraf, kaber, davem, ying.xue,
	netdev, netfilter-devel

On Mon, Feb 23, 2015 at 10:49:04AM -0800, Paul E. McKenney wrote:
> Hello!
> 
> Alexei mentioned that there was some excitement a couple of weeks ago in
> Ottawa, something about the resizing taking forever when there were large
> numbers of concurrent additions.  One approach comes to mind:
> 
> o	Currently, the hash table does not allow additions concurrently
> 	with resize operations.  One way to allow this would be to
> 	have the addition operations add to the new hash table at the
> 	head of the lists.  This would clearly require also updating the
> 	pointers used to control the unzip operation.

In theory, resizes should only take the locks for the buckets they're
currently unzipping, and adds should take those same locks.  Neither one
should take a whole-table lock, other than resize excluding concurrent
resizes.  Is that still insufficient?

> o	Count the number of entries added during the resize operation.
> 	Then, at the end of the resize operation, if enough entries have
> 	been added, do a resize, but by multiple factors of two if
> 	need be.

Yeah, the add/remove statistics used for tracking would need some
special handling to avoid being a table-wide bottleneck.

- Josh Triplett

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 19:12 ` josh
@ 2015-02-23 21:03   ` Thomas Graf
  2015-02-23 21:52     ` Paul E. McKenney
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Graf @ 2015-02-23 21:03 UTC (permalink / raw)
  To: josh
  Cc: Paul E. McKenney, alexei.starovoitov, herbert, kaber, davem,
	ying.xue, netdev, netfilter-devel

On 02/23/15 at 11:12am, josh@joshtriplett.org wrote:
> In theory, resizes should only take the locks for the buckets they're
> currently unzipping, and adds should take those same locks.  Neither one
> should take a whole-table lock, other than resize excluding concurrent
> resizes.  Is that still insufficient?

Correct, this is what happens. The problem is basically that
if we insert from atomic context we cannot slow down inserts
and the table may not grow quickly enough.

> Yeah, the add/remove statistics used for tracking would need some
> special handling to avoid being a table-wide bottleneck.

Daniel is working on a patch to do per-cpu element counting
with a batched update cycle.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 21:03   ` Thomas Graf
@ 2015-02-23 21:52     ` Paul E. McKenney
  2015-02-23 22:32       ` David Miller
  0 siblings, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2015-02-23 21:52 UTC (permalink / raw)
  To: Thomas Graf
  Cc: josh, alexei.starovoitov, herbert, kaber, davem, ying.xue, netdev,
	netfilter-devel

On Mon, Feb 23, 2015 at 09:03:58PM +0000, Thomas Graf wrote:
> On 02/23/15 at 11:12am, josh@joshtriplett.org wrote:
> > In theory, resizes should only take the locks for the buckets they're
> > currently unzipping, and adds should take those same locks.  Neither one
> > should take a whole-table lock, other than resize excluding concurrent
> > resizes.  Is that still insufficient?
> 
> Correct, this is what happens. The problem is basically that
> if we insert from atomic context we cannot slow down inserts
> and the table may not grow quickly enough.
> 
> > Yeah, the add/remove statistics used for tracking would need some
> > special handling to avoid being a table-wide bottleneck.
> 
> Daniel is working on a patch to do per-cpu element counting
> with a batched update cycle.

One approach is simply to count only when a resize operation is in
flight.  Another is to keep a per-bucket count, which can be summed
at the beginning of the next resize operation.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 21:52     ` Paul E. McKenney
@ 2015-02-23 22:32       ` David Miller
  2015-02-23 23:06         ` Paul E. McKenney
  0 siblings, 1 reply; 33+ messages in thread
From: David Miller @ 2015-02-23 22:32 UTC (permalink / raw)
  To: paulmck
  Cc: tgraf, josh, alexei.starovoitov, herbert, kaber, ying.xue, netdev,
	netfilter-devel

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Mon, 23 Feb 2015 13:52:49 -0800

> On Mon, Feb 23, 2015 at 09:03:58PM +0000, Thomas Graf wrote:
>> On 02/23/15 at 11:12am, josh@joshtriplett.org wrote:
>> > In theory, resizes should only take the locks for the buckets they're
>> > currently unzipping, and adds should take those same locks.  Neither one
>> > should take a whole-table lock, other than resize excluding concurrent
>> > resizes.  Is that still insufficient?
>> 
>> Correct, this is what happens. The problem is basically that
>> if we insert from atomic context we cannot slow down inserts
>> and the table may not grow quickly enough.
>> 
>> > Yeah, the add/remove statistics used for tracking would need some
>> > special handling to avoid being a table-wide bottleneck.
>> 
>> Daniel is working on a patch to do per-cpu element counting
>> with a batched update cycle.
> 
> One approach is simply to count only when a resize operation is in
> flight.  Another is to keep a per-bucket count, which can be summed
> at the beginning of the next resize operation.

I think we should think exactly about what we should do when someone
loops non-stop adding 1 million entries to the hash table and the
initial table size is very small.

This is a common use case for at least one of the current rhashtable
users (nft_hash).  When you load an nftables rule with a large set
of IP addresses attached, this is what happens.

Yes I understand that nftables could give a hint and start with a
larger hash size from the start when it knows this is going to happen,
but I still believe that we should behave reasonably when starting
from a small table.

I'd say that with the way things work right now, in this situation it
actually hurts to allow asynchronous inserts during a resize.  Because
we end up with extremely long hash table chains, and thus make the
resize work and the lookups both take an excruciatingly long amount of
time to complete.

I just did a quick scan of all code paths that do inserts into an
rhashtable, and it seems like all of them can easily block.  So why
don't we do that?  Make inserts sleep on an rhashtable expansion
waitq.

There could even be a counter of pending inserts, so the expander can
decide to expand further before waking the inserting threads up.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 22:32       ` David Miller
@ 2015-02-23 23:06         ` Paul E. McKenney
  2015-02-24  8:37           ` Thomas Graf
  0 siblings, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2015-02-23 23:06 UTC (permalink / raw)
  To: David Miller
  Cc: tgraf, josh, alexei.starovoitov, herbert, kaber, ying.xue, netdev,
	netfilter-devel

On Mon, Feb 23, 2015 at 05:32:52PM -0500, David Miller wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Date: Mon, 23 Feb 2015 13:52:49 -0800
> 
> > On Mon, Feb 23, 2015 at 09:03:58PM +0000, Thomas Graf wrote:
> >> On 02/23/15 at 11:12am, josh@joshtriplett.org wrote:
> >> > In theory, resizes should only take the locks for the buckets they're
> >> > currently unzipping, and adds should take those same locks.  Neither one
> >> > should take a whole-table lock, other than resize excluding concurrent
> >> > resizes.  Is that still insufficient?
> >> 
> >> Correct, this is what happens. The problem is basically that
> >> if we insert from atomic context we cannot slow down inserts
> >> and the table may not grow quickly enough.
> >> 
> >> > Yeah, the add/remove statistics used for tracking would need some
> >> > special handling to avoid being a table-wide bottleneck.
> >> 
> >> Daniel is working on a patch to do per-cpu element counting
> >> with a batched update cycle.
> > 
> > One approach is simply to count only when a resize operation is in
> > flight.  Another is to keep a per-bucket count, which can be summed
> > at the beginning of the next resize operation.
> 
> I think we should think exactly about what we should do when someone
> loops non-stop adding 1 million entries to the hash table and the
> initial table size is very small.
> 
> This is a common use case for at least one of the current rhashtable
> users (nft_hash).  When you load an nftables rule with a large set
> of IP addresses attached, this is what happens.
> 
> Yes I understand that nftables could give a hint and start with a
> larger hash size from the start when it knows this is going to happen,
> but I still believe that we should behave reasonably when starting
> from a small table.
> 
> I'd say that with the way things work right now, in this situation it
> actually hurts to allow asynchronous inserts during a resize.  Because
> we end up with extremely long hash table chains, and thus make the
> resize work and the lookups both take an excruciatingly long amount of
> time to complete.
> 
> I just did a quick scan of all code paths that do inserts into an
> rhashtable, and it seems like all of them can easily block.  So why
> don't we do that?  Make inserts sleep on an rhashtable expansion
> waitq.
> 
> There could even be a counter of pending inserts, so the expander can
> decide to expand further before waking the inserting threads up.

Should be reasonably simple, and certainly seems worth a try!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 23:06         ` Paul E. McKenney
@ 2015-02-24  8:37           ` Thomas Graf
  2015-02-24 10:39             ` Patrick McHardy
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Graf @ 2015-02-24  8:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: David Miller, josh, alexei.starovoitov, herbert, kaber, ying.xue,
	netdev, netfilter-devel

On 02/23/15 at 03:06pm, Paul E. McKenney wrote:
> On Mon, Feb 23, 2015 at 05:32:52PM -0500, David Miller wrote:
> > I just did a quick scan of all code paths that do inserts into an
> > rhashtable, and it seems like all of them can easily block.  So why
> > don't we do that?  Make inserts sleep on an rhashtable expansion
> > waitq.
> > 
> > There could even be a counter of pending inserts, so the expander can
> > decide to expand further before waking the inserting threads up.
> 
> Should be reasonably simple, and certainly seems worth a try!

Agreed. Definitely desirable for nft_hash. I like the pending counter
idea. I'm experimenting with various ideas on blocking inserts for
Netlink. Blocking too long might open DoS vectors as one app could
easily delay the creation of sockets for other applications.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24  8:37           ` Thomas Graf
@ 2015-02-24 10:39             ` Patrick McHardy
  2015-02-24 10:46               ` David Laight
  2015-02-24 17:09               ` David Miller
  0 siblings, 2 replies; 33+ messages in thread
From: Patrick McHardy @ 2015-02-24 10:39 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Paul E. McKenney, David Miller, josh, alexei.starovoitov, herbert,
	ying.xue, netdev, netfilter-devel

On 24.02, Thomas Graf wrote:
> On 02/23/15 at 03:06pm, Paul E. McKenney wrote:
> > On Mon, Feb 23, 2015 at 05:32:52PM -0500, David Miller wrote:
> > > I just did a quick scan of all code paths that do inserts into an
> > > rhashtable, and it seems like all of them can easily block.  So why
> > > don't we do that?  Make inserts sleep on an rhashtable expansion
> > > waitq.
> > > 
> > > There could even be a counter of pending inserts, so the expander can
> > > decide to expand further before waking the inserting threads up.
> > 
> > Should be reasonably simple, and certainly seems worth a try!
> 
> Agreed. Definitely desirable for nft_hash. I like the pending counter
> idea. I'm experimenting with various ideas on blocking inserts for
> Netlink. Blocking too long might open DoS vectors as one app could
> easily delay the creation of sockets for other applications.

Regarding nft_hash, blocking in the netlink path certainly seems fine,
but we will soon also have inserts from the packet processing path,
where we obviously can't block.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: Ottawa and slow hash-table resize
  2015-02-24 10:39             ` Patrick McHardy
@ 2015-02-24 10:46               ` David Laight
  2015-02-24 10:48                 ` Patrick McHardy
  2015-02-24 17:09               ` David Miller
  1 sibling, 1 reply; 33+ messages in thread
From: David Laight @ 2015-02-24 10:46 UTC (permalink / raw)
  To: 'Patrick McHardy', Thomas Graf
  Cc: Paul E. McKenney, David Miller, josh@joshtriplett.org,
	alexei.starovoitov@gmail.com, herbert@gondor.apana.org.au,
	ying.xue@windriver.com, netdev@vger.kernel.org,
	netfilter-devel@vger.kernel.org

From: Patrick McHardy
> On 24.02, Thomas Graf wrote:
> > On 02/23/15 at 03:06pm, Paul E. McKenney wrote:
> > > On Mon, Feb 23, 2015 at 05:32:52PM -0500, David Miller wrote:
> > > > I just did a quick scan of all code paths that do inserts into an
> > > > rhashtable, and it seems like all of them can easily block.  So why
> > > > don't we do that?  Make inserts sleep on an rhashtable expansion
> > > > waitq.
> > > >
> > > > There could even be a counter of pending inserts, so the expander can
> > > > decide to expand further before waking the inserting threads up.
> > >
> > > Should be reasonably simple, and certainly seems worth a try!
> >
> > Agreed. Definitely desirable for nft_hash. I like the pending counter
> > idea. I'm experimenting with various ideas on blocking inserts for
> > Netlink. Blocking too long might open DoS vectors as one app could
> > easily delay the creation of sockets for other applications.
> 
> Regarding nft_hash, blocking in the netlink path certainly seems fine,
> but we will soon also have inserts from the packet processing path,
> where we obviously can't block.

Why not an option to do a synchronous 'expand on insert' for
codepaths that can block?

	David

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 10:46               ` David Laight
@ 2015-02-24 10:48                 ` Patrick McHardy
  0 siblings, 0 replies; 33+ messages in thread
From: Patrick McHardy @ 2015-02-24 10:48 UTC (permalink / raw)
  To: David Laight
  Cc: Thomas Graf, Paul E. McKenney, David Miller,
	josh@joshtriplett.org, alexei.starovoitov@gmail.com,
	herbert@gondor.apana.org.au, ying.xue@windriver.com,
	netdev@vger.kernel.org, netfilter-devel@vger.kernel.org

On 24.02, David Laight wrote:
> From: Patrick McHardy
> > On 24.02, Thomas Graf wrote:
> > > On 02/23/15 at 03:06pm, Paul E. McKenney wrote:
> > > > On Mon, Feb 23, 2015 at 05:32:52PM -0500, David Miller wrote:
> > > > > I just did a quick scan of all code paths that do inserts into an
> > > > > rhashtable, and it seems like all of them can easily block.  So why
> > > > > don't we do that?  Make inserts sleep on an rhashtable expansion
> > > > > waitq.
> > > > >
> > > > > There could even be a counter of pending inserts, so the expander can
> > > > > decide to expand further before waking the inserting threads up.
> > > >
> > > > Should be reasonably simple, and certainly seems worth a try!
> > >
> > > Agreed. Definitely desirable for nft_hash. I like the pending counter
> > > idea. I'm experimenting with various ideas on blocking inserts for
> > > Netlink. Blocking too long might open DoS vectors as one app could
> > > easily delay the creation of sockets for other applications.
> > 
> > Regarding nft_hash, blocking in the netlink path certainly seems fine,
> > but we will soon also have inserts from the packet processing path,
> > where we obviously can't block.
> 
> Why not an option to do a synchronous 'expand on insert' for
> codepaths that can block?

Sounds perfectly fine to me, just wanted to point out that we also need
to handle the non-synchronous case properly.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 10:39             ` Patrick McHardy
  2015-02-24 10:46               ` David Laight
@ 2015-02-24 17:09               ` David Miller
  2015-02-24 17:50                 ` Thomas Graf
  2015-02-25  8:55                 ` Herbert Xu
  1 sibling, 2 replies; 33+ messages in thread
From: David Miller @ 2015-02-24 17:09 UTC (permalink / raw)
  To: kaber
  Cc: tgraf, paulmck, josh, alexei.starovoitov, herbert, ying.xue,
	netdev, netfilter-devel

From: Patrick McHardy <kaber@trash.net>
Date: Tue, 24 Feb 2015 10:39:18 +0000

> On 24.02, Thomas Graf wrote:
>> On 02/23/15 at 03:06pm, Paul E. McKenney wrote:
>> > On Mon, Feb 23, 2015 at 05:32:52PM -0500, David Miller wrote:
>> > > I just did a quick scan of all code paths that do inserts into an
>> > > rhashtable, and it seems like all of them can easily block.  So why
>> > > don't we do that?  Make inserts sleep on an rhashtable expansion
>> > > waitq.
>> > > 
>> > > There could even be a counter of pending inserts, so the expander can
>> > > decide to expand further before waking the inserting threads up.
>> > 
>> > Should be reasonably simple, and certainly seems worth a try!
>> 
>> Agreed. Definitely desirable for nft_hash. I like the pending counter
>> idea. I'm experimenting with various ideas on blocking inserts for
>> Netlink. Blocking too long might open DoS vectors as one app could
>> easily delay the creation of sockets for other applications.
> 
> Regarding nft_hash, blocking in the netlink path certainly seems fine,
> but we will soon also have inserts from the packet processing path,
> where we obviously can't block.

Indeed, I remembered last night that for TCP sockets blocking will not
work at all.

And having a flood of 1 million new TCP connections all at once
shouldn't knock us over.

Therefore, we will need to find a way to handle this problem without
being able to block on insert.

Thinking about this, if inserts occur during a pending resize, if the
nelems of the table has exceeded even the grow threshold for the new
table, it makes no sense to allow these async inserts as they are
going to make the resize take longer and prolong the pain.

On one hand I like the async resize because it means that an insert
that triggers the resize doesn't incur a huge latency spike since
it was simply unlucky to be the resize trigger event.  The async
resize smoothes out the cost of the resize across the system.

This scheme works really well if, on average, the resize operation
completes before enough subsequent inserts occur to exceed even
the resized tables resize threshold.

So I think what I'm getting at is that we can allow parallel inserts
but only up until the point where the resized tables thresholds are
exceeded.

Looking at how to implement this, I think that there is too much
configurability to this code.  There is no reason to have indirect
calls for the grow decision.  This should be a quick test, but it's
not because we go through ->grow_decision.  It should just be
rht_grow_above_75 or whatever, and inline this crap!

Nobody even uses this indirection capability, it's therefore over
engineered :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 17:09               ` David Miller
@ 2015-02-24 17:50                 ` Thomas Graf
  2015-02-24 18:26                   ` David Miller
  2015-02-24 18:33                   ` josh
  2015-02-25  8:55                 ` Herbert Xu
  1 sibling, 2 replies; 33+ messages in thread
From: Thomas Graf @ 2015-02-24 17:50 UTC (permalink / raw)
  To: David Miller
  Cc: kaber, paulmck, josh, alexei.starovoitov, herbert, ying.xue,
	netdev, netfilter-devel

On 02/24/15 at 12:09pm, David Miller wrote:
> And having a flood of 1 million new TCP connections all at once
> shouldn't knock us over.
> 
> Therefore, we will need to find a way to handle this problem without
> being able to block on insert.

One possible way to handle this is to have users like TCP grow
quicker than 2x. Maybe start with 16x and grow slower and slower
using a log function. (No, we do not want rhashtable congestion
control algos ;-)

> Thinking about this, if inserts occur during a pending resize, if the
> nelems of the table has exceeded even the grow threshold for the new
> table, it makes no sense to allow these async inserts as they are
> going to make the resize take longer and prolong the pain.

Let's say we start with an initial table size of 16K (we can make
this system memory depenend) and we grow by 8x. New inserts go
into the new table immediately so as soon as we have 12K entries
we'll grow right to 128K buckets. As we grow above 75K we'll start
growing to 1024K buckets. New entries already go to the 1024K
buckets at this point given that the first grow cycle should be
fast. The 2nd grow cycle would take an est 6 RCU grace periods.
This would also still give us a max of 8K bucket locks which
should be good enough as well.

Just thinking this out loud. Still working on this.

> On one hand I like the async resize because it means that an insert
> that triggers the resize doesn't incur a huge latency spike since
> it was simply unlucky to be the resize trigger event.  The async
> resize smoothes out the cost of the resize across the system.
> 
> This scheme works really well if, on average, the resize operation
> completes before enough subsequent inserts occur to exceed even
> the resized tables resize threshold.
> 
> So I think what I'm getting at is that we can allow parallel inserts
> but only up until the point where the resized tables thresholds are
> exceeded.
> 
> Looking at how to implement this, I think that there is too much
> configurability to this code.  There is no reason to have indirect
> calls for the grow decision.  This should be a quick test, but it's
> not because we go through ->grow_decision.  It should just be
> rht_grow_above_75 or whatever, and inline this crap!
> 
> Nobody even uses this indirection capability, it's therefore over
> engineered :-)

Another option is to only call the grow_decision once every N inserts
or removals (32? 64?) and handle updates as batches. No objection
to ditching the grow/shrink function for now though. Not sure we
anyone actually needs different growth semantics.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 17:50                 ` Thomas Graf
@ 2015-02-24 18:26                   ` David Miller
  2015-02-24 18:45                     ` josh
  2015-02-24 18:33                   ` josh
  1 sibling, 1 reply; 33+ messages in thread
From: David Miller @ 2015-02-24 18:26 UTC (permalink / raw)
  To: tgraf
  Cc: kaber, paulmck, josh, alexei.starovoitov, herbert, ying.xue,
	netdev, netfilter-devel

From: Thomas Graf <tgraf@suug.ch>
Date: Tue, 24 Feb 2015 17:50:14 +0000

> On 02/24/15 at 12:09pm, David Miller wrote:
>> Thinking about this, if inserts occur during a pending resize, if the
>> nelems of the table has exceeded even the grow threshold for the new
>> table, it makes no sense to allow these async inserts as they are
>> going to make the resize take longer and prolong the pain.
> 
> Let's say we start with an initial table size of 16K (we can make
> this system memory depenend) and we grow by 8x. New inserts go
> into the new table immediately so as soon as we have 12K entries
> we'll grow right to 128K buckets. As we grow above 75K we'll start
> growing to 1024K buckets. New entries already go to the 1024K
> buckets at this point given that the first grow cycle should be
> fast. The 2nd grow cycle would take an est 6 RCU grace periods.
> This would also still give us a max of 8K bucket locks which
> should be good enough as well.

Actually, first of all, let's not start with larger tables.

The network namespace folks showed clearly that hash tables
are detrimental to per-ns memory costs.  So they definitely
want us to start with extremely small tables.

But once we know something is actively used, sure, increase
the table grow rate as a response to demand.

So how feasible is it to grow by 4x, 8x, or other powers of
two in one resize operation?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 18:26                   ` David Miller
@ 2015-02-24 18:45                     ` josh
  2015-02-24 22:34                       ` Thomas Graf
  0 siblings, 1 reply; 33+ messages in thread
From: josh @ 2015-02-24 18:45 UTC (permalink / raw)
  To: David Miller
  Cc: tgraf, kaber, paulmck, alexei.starovoitov, herbert, ying.xue,
	netdev, netfilter-devel

On Tue, Feb 24, 2015 at 01:26:03PM -0500, David Miller wrote:
> From: Thomas Graf <tgraf@suug.ch>
> Date: Tue, 24 Feb 2015 17:50:14 +0000
> 
> > On 02/24/15 at 12:09pm, David Miller wrote:
> >> Thinking about this, if inserts occur during a pending resize, if the
> >> nelems of the table has exceeded even the grow threshold for the new
> >> table, it makes no sense to allow these async inserts as they are
> >> going to make the resize take longer and prolong the pain.
> > 
> > Let's say we start with an initial table size of 16K (we can make
> > this system memory depenend) and we grow by 8x. New inserts go
> > into the new table immediately so as soon as we have 12K entries
> > we'll grow right to 128K buckets. As we grow above 75K we'll start
> > growing to 1024K buckets. New entries already go to the 1024K
> > buckets at this point given that the first grow cycle should be
> > fast. The 2nd grow cycle would take an est 6 RCU grace periods.
> > This would also still give us a max of 8K bucket locks which
> > should be good enough as well.
> 
> Actually, first of all, let's not start with larger tables.
> 
> The network namespace folks showed clearly that hash tables
> are detrimental to per-ns memory costs.  So they definitely
> want us to start with extremely small tables.

Agreed; ideally, the initial table size would just use a single page for
the array of bucket heads, which would give 1024 buckets on 32-bit
systems or 512 on 64-bit systems.  That's more than enough for many
client systems, and for many single-application network namespaces.

> But once we know something is actively used, sure, increase
> the table grow rate as a response to demand.
> 
> So how feasible is it to grow by 4x, 8x, or other powers of
> two in one resize operation?

Quite feasible.  Actually, any integer multiple works fine, though I
think a power of two makes sense.  I'd suggest trying 4x with the same
workloads that had an issue at 2x, and seeing how that goes.

- Josh Triplett

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 18:45                     ` josh
@ 2015-02-24 22:34                       ` Thomas Graf
  2015-02-25  8:56                         ` Herbert Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Graf @ 2015-02-24 22:34 UTC (permalink / raw)
  To: josh
  Cc: David Miller, kaber, paulmck, alexei.starovoitov, herbert,
	ying.xue, netdev, netfilter-devel

On 02/24/15 at 10:45am, josh@joshtriplett.org wrote:
> On Tue, Feb 24, 2015 at 01:26:03PM -0500, David Miller wrote:
> > Actually, first of all, let's not start with larger tables.
> > 
> > The network namespace folks showed clearly that hash tables
> > are detrimental to per-ns memory costs.  So they definitely
> > want us to start with extremely small tables.
> 
> Agreed; ideally, the initial table size would just use a single page for
> the array of bucket heads, which would give 1024 buckets on 32-bit
> systems or 512 on 64-bit systems.  That's more than enough for many
> client systems, and for many single-application network namespaces.

No objection at all. I certainly understand the implications to
netns. After all this is the reason why rhashtable exists. However,
the initial tabe size plus number of growth cycles has some
implications on max number of bucket locks (see below). So it's
a matter of balance that needs some thought and experimentation.

> > But once we know something is actively used, sure, increase
> > the table grow rate as a response to demand.
> > 
> > So how feasible is it to grow by 4x, 8x, or other powers of
> > two in one resize operation?
> 
> Quite feasible.  Actually, any integer multiple works fine, though I
> think a power of two makes sense.  I'd suggest trying 4x with the same
> workloads that had an issue at 2x, and seeing how that goes.

There is a side effect. We can't grow the number of bucket locks more
than 2x if we grow the table itself faster than 2x. So if we start
out with a table size of 512 and grow 4 times in a row we will end
up with a theoretical max bucket locks of 4K. Probably enough though.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 22:34                       ` Thomas Graf
@ 2015-02-25  8:56                         ` Herbert Xu
  2015-02-25 17:38                           ` Thomas Graf
  0 siblings, 1 reply; 33+ messages in thread
From: Herbert Xu @ 2015-02-25  8:56 UTC (permalink / raw)
  To: Thomas Graf
  Cc: josh, David Miller, kaber, paulmck, alexei.starovoitov, ying.xue,
	netdev, netfilter-devel

On Tue, Feb 24, 2015 at 10:34:09PM +0000, Thomas Graf wrote:
>
> There is a side effect. We can't grow the number of bucket locks more
> than 2x if we grow the table itself faster than 2x. So if we start
> out with a table size of 512 and grow 4 times in a row we will end
> up with a theoretical max bucket locks of 4K. Probably enough though.

Does this limit on bucket locks apply to my rehash scheme?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-25  8:56                         ` Herbert Xu
@ 2015-02-25 17:38                           ` Thomas Graf
  0 siblings, 0 replies; 33+ messages in thread
From: Thomas Graf @ 2015-02-25 17:38 UTC (permalink / raw)
  To: Herbert Xu
  Cc: josh, David Miller, kaber, paulmck, alexei.starovoitov, ying.xue,
	netdev, netfilter-devel

On 02/25/15 at 09:56pm, Herbert Xu wrote:
> On Tue, Feb 24, 2015 at 10:34:09PM +0000, Thomas Graf wrote:
> >
> > There is a side effect. We can't grow the number of bucket locks more
> > than 2x if we grow the table itself faster than 2x. So if we start
> > out with a table size of 512 and grow 4 times in a row we will end
> > up with a theoretical max bucket locks of 4K. Probably enough though.
> 
> Does this limit on bucket locks apply to my rehash scheme?

I don't think so. I assume you get rid of all the old bucket
== new bucket assumptions and introduce some form of nested
locking to account for the fact that entries in an old bucket
may map to an arbitrary number of new buckets.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 17:50                 ` Thomas Graf
  2015-02-24 18:26                   ` David Miller
@ 2015-02-24 18:33                   ` josh
  1 sibling, 0 replies; 33+ messages in thread
From: josh @ 2015-02-24 18:33 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David Miller, kaber, paulmck, alexei.starovoitov, herbert,
	ying.xue, netdev, netfilter-devel

On Tue, Feb 24, 2015 at 05:50:14PM +0000, Thomas Graf wrote:
> On 02/24/15 at 12:09pm, David Miller wrote:
> > And having a flood of 1 million new TCP connections all at once
> > shouldn't knock us over.
> > 
> > Therefore, we will need to find a way to handle this problem without
> > being able to block on insert.
> 
> One possible way to handle this is to have users like TCP grow
> quicker than 2x. Maybe start with 16x and grow slower and slower
> using a log function. (No, we do not want rhashtable congestion
> control algos ;-)
> 
> > Thinking about this, if inserts occur during a pending resize, if the
> > nelems of the table has exceeded even the grow threshold for the new
> > table, it makes no sense to allow these async inserts as they are
> > going to make the resize take longer and prolong the pain.
> 
> Let's say we start with an initial table size of 16K (we can make
> this system memory depenend) and we grow by 8x. New inserts go
> into the new table immediately so as soon as we have 12K entries
> we'll grow right to 128K buckets. As we grow above 75K we'll start
> growing to 1024K buckets. New entries already go to the 1024K
> buckets at this point given that the first grow cycle should be
> fast. The 2nd grow cycle would take an est 6 RCU grace periods.
> This would also still give us a max of 8K bucket locks which
> should be good enough as well.
> 
> Just thinking this out loud. Still working on this.

I agree.  Client systems should start with the smallest possible table
size and memory usage (just enough for dozens or hundreds of
connections), and possibly never grow past that.  Any system processing
a non-trivial number of connections, however, wants to very quickly grow
to a substantial number of buckets.  The unzipping algorithm works just
fine for any integer growth factor; it just gets a bit more complicated.

One nice thing is that the resize algorithm very quickly allocates the
new buckets and sets up the head pointers such that the new table can be
used for inserts almost immediately, *without* a synchronize_rcu.  Only
the bucket unzipping process takes a non-trivial amount of time
(including one or more synchronize_rcu calls).  And the newly inserted
entries will go directly to the appropriate buckets, so they'll take
advantage of the larger table size.

> > On one hand I like the async resize because it means that an insert
> > that triggers the resize doesn't incur a huge latency spike since
> > it was simply unlucky to be the resize trigger event.  The async
> > resize smoothes out the cost of the resize across the system.
> > 
> > This scheme works really well if, on average, the resize operation
> > completes before enough subsequent inserts occur to exceed even
> > the resized tables resize threshold.
> > 
> > So I think what I'm getting at is that we can allow parallel inserts
> > but only up until the point where the resized tables thresholds are
> > exceeded.
> > 
> > Looking at how to implement this, I think that there is too much
> > configurability to this code.  There is no reason to have indirect
> > calls for the grow decision.  This should be a quick test, but it's
> > not because we go through ->grow_decision.  It should just be
> > rht_grow_above_75 or whatever, and inline this crap!
> > 
> > Nobody even uses this indirection capability, it's therefore over
> > engineered :-)
> 
> Another option is to only call the grow_decision once every N inserts
> or removals (32? 64?) and handle updates as batches.

If we have a means of tracking the number of inserts, we already have
the ability to make the decision, which is just a single comparison.  No
need to batch, since the decision of whether to check would *also*
require a comparison.

I do think this should just use the same growth function everywhere
until a user comes along that needs something different.

- Josh Triplett

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 17:09               ` David Miller
  2015-02-24 17:50                 ` Thomas Graf
@ 2015-02-25  8:55                 ` Herbert Xu
  2015-02-25 17:38                   ` Thomas Graf
  1 sibling, 1 reply; 33+ messages in thread
From: Herbert Xu @ 2015-02-25  8:55 UTC (permalink / raw)
  To: David Miller
  Cc: kaber, tgraf, paulmck, josh, alexei.starovoitov, ying.xue, netdev,
	netfilter-devel

On Tue, Feb 24, 2015 at 12:09:44PM -0500, David Miller wrote:
>
> So I think what I'm getting at is that we can allow parallel inserts
> but only up until the point where the resized tables thresholds are
> exceeded.

Agreed.  At some point you have to either resize immediately
or fail the insertion.

Note that once my rehash work is complete there is another option
of allowing multiple resizes.  That is, if we reach the limit
while a resize is still ongoing we simply start a new one
immediately.

I'm just coming back form vacation so please be patient and give
me a chance to finish this off.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-25  8:55                 ` Herbert Xu
@ 2015-02-25 17:38                   ` Thomas Graf
  0 siblings, 0 replies; 33+ messages in thread
From: Thomas Graf @ 2015-02-25 17:38 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, kaber, paulmck, josh, alexei.starovoitov, ying.xue,
	netdev, netfilter-devel

On 02/25/15 at 09:55pm, Herbert Xu wrote:
> I'm just coming back form vacation so please be patient and give
> me a chance to finish this off.

I'm waiting for your rehash work to complete before I touch
anything else at this point.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 18:49 Ottawa and slow hash-table resize Paul E. McKenney
  2015-02-23 19:12 ` josh
@ 2015-02-23 21:00 ` Thomas Graf
  2015-02-23 22:35   ` Paul E. McKenney
  2015-02-24  8:59 ` Thomas Graf
  2 siblings, 1 reply; 33+ messages in thread
From: Thomas Graf @ 2015-02-23 21:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: alexei.starovoitov, herbert, kaber, davem, ying.xue, netdev,
	netfilter-devel, josh

On 02/23/15 at 10:49am, Paul E. McKenney wrote:
> Hello!
> 
> Alexei mentioned that there was some excitement a couple of weeks ago in
> Ottawa, something about the resizing taking forever when there were large
> numbers of concurrent additions.  One approach comes to mind:
> 
> o	Currently, the hash table does not allow additions concurrently
> 	with resize operations.  One way to allow this would be to
> 	have the addition operations add to the new hash table at the
> 	head of the lists.  This would clearly require also updating the
> 	pointers used to control the unzip operation.

I've already added this. Additions and removals can occur in
parallel to the resize and will go to the head of the new chain.

> o	Count the number of entries added during the resize operation.
> 	Then, at the end of the resize operation, if enough entries have
> 	been added, do a resize, but by multiple factors of two if
> 	need be.
> 
> This should allow the table to take arbitrarily large numbers of updates
> during a resize operation.  There are some other possibilities if this
> approach does not work out.

The main problem is rapid growth of the table on small tables,
e.g. shift 4-6. Going through multiple grow cycles while
thousands of entries are being added will lead to long chains
which will require multiple RCU grace periods per growth and
thus slowing things down.

The bucket locking is designed to ignore the highest order bit
of the hash to make sure that a single bucket lock in the new
double sized table protectes both buckets which map to the 
same bucket in the old table. This simplifies locking a lot and
does not require nested locking. Growing by more than a factor
of two would require to manually lock all buckets to which
entries in the old bucket may map to.

However, we do not want to grow the bucket lock mask
indefinitely so we could for example growth quicker if the
lock mask allows. Needs some more thought but it's definitely
doable and we need to provide users of the hash table with
ways to find a balance according to their needs.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 21:00 ` Thomas Graf
@ 2015-02-23 22:35   ` Paul E. McKenney
  0 siblings, 0 replies; 33+ messages in thread
From: Paul E. McKenney @ 2015-02-23 22:35 UTC (permalink / raw)
  To: Thomas Graf
  Cc: alexei.starovoitov, herbert, kaber, davem, ying.xue, netdev,
	netfilter-devel, josh

On Mon, Feb 23, 2015 at 09:00:37PM +0000, Thomas Graf wrote:
> On 02/23/15 at 10:49am, Paul E. McKenney wrote:
> > Hello!
> > 
> > Alexei mentioned that there was some excitement a couple of weeks ago in
> > Ottawa, something about the resizing taking forever when there were large
> > numbers of concurrent additions.  One approach comes to mind:
> > 
> > o	Currently, the hash table does not allow additions concurrently
> > 	with resize operations.  One way to allow this would be to
> > 	have the addition operations add to the new hash table at the
> > 	head of the lists.  This would clearly require also updating the
> > 	pointers used to control the unzip operation.
> 
> I've already added this. Additions and removals can occur in
> parallel to the resize and will go to the head of the new chain.

Good!  (I guess I got confused by one of the comments.  Then again,
I was looking at 3.19.)

> > o	Count the number of entries added during the resize operation.
> > 	Then, at the end of the resize operation, if enough entries have
> > 	been added, do a resize, but by multiple factors of two if
> > 	need be.
> > 
> > This should allow the table to take arbitrarily large numbers of updates
> > during a resize operation.  There are some other possibilities if this
> > approach does not work out.
> 
> The main problem is rapid growth of the table on small tables,
> e.g. shift 4-6. Going through multiple grow cycles while
> thousands of entries are being added will lead to long chains
> which will require multiple RCU grace periods per growth and
> thus slowing things down.
> 
> The bucket locking is designed to ignore the highest order bit
> of the hash to make sure that a single bucket lock in the new
> double sized table protectes both buckets which map to the 
> same bucket in the old table. This simplifies locking a lot and
> does not require nested locking. Growing by more than a factor
> of two would require to manually lock all buckets to which
> entries in the old bucket may map to.

Or just ignore the (say) two upper bits if growing by (say) a factor
of four.  (If I understand what you are doing here, anyway.)

> However, we do not want to grow the bucket lock mask
> indefinitely so we could for example growth quicker if the
> lock mask allows. Needs some more thought but it's definitely
> doable and we need to provide users of the hash table with
> ways to find a balance according to their needs.

Indeed, finding the right balance can be tricky!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 18:49 Ottawa and slow hash-table resize Paul E. McKenney
  2015-02-23 19:12 ` josh
  2015-02-23 21:00 ` Thomas Graf
@ 2015-02-24  8:59 ` Thomas Graf
  2015-02-24  9:38   ` Daniel Borkmann
  2 siblings, 1 reply; 33+ messages in thread
From: Thomas Graf @ 2015-02-24  8:59 UTC (permalink / raw)
  To: Paul E. McKenney, davem
  Cc: alexei.starovoitov, herbert, kaber, ying.xue, netdev,
	netfilter-devel, josh

On 02/23/15 at 10:49am, Paul E. McKenney wrote:
> Hello!
> 
> Alexei mentioned that there was some excitement a couple of weeks ago in
> Ottawa, something about the resizing taking forever when there were large
> numbers of concurrent additions.  One approach comes to mind:

@Dave et al,
Just want to make sure we have the same level of understanding of
urgency for this. The only practical problem experienced so far is
loading n*1M entries into an nft_hash set which takes 3s for 4M
entries upon restore. A regular add is actually fine as it provides
a hint and sizes the table accordingly.

I agree that rhashtable should be improved to better handle many
inserts on small tables but wanted to make sure whether anyone thinks
this needs urgent fixing in 3.20 or whether we can take some time to
fully understand all scenarios and experiment with ideas and then
propose solutions for the next merge window. I also have the TCP hash
tables on my radar while improving this.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24  8:59 ` Thomas Graf
@ 2015-02-24  9:38   ` Daniel Borkmann
  2015-02-24 10:42     ` Patrick McHardy
  0 siblings, 1 reply; 33+ messages in thread
From: Daniel Borkmann @ 2015-02-24  9:38 UTC (permalink / raw)
  To: Thomas Graf, Paul E. McKenney, davem
  Cc: alexei.starovoitov, herbert, kaber, ying.xue, netdev,
	netfilter-devel, josh, johunt

On 02/24/2015 09:59 AM, Thomas Graf wrote:
> On 02/23/15 at 10:49am, Paul E. McKenney wrote:
>> Hello!
>>
>> Alexei mentioned that there was some excitement a couple of weeks ago in
>> Ottawa, something about the resizing taking forever when there were large
>> numbers of concurrent additions.  One approach comes to mind:
>
> @Dave et al,
> Just want to make sure we have the same level of understanding of
> urgency for this. The only practical problem experienced so far is
> loading n*1M entries into an nft_hash set which takes 3s for 4M
> entries upon restore. A regular add is actually fine as it provides
> a hint and sizes the table accordingly.

Btw, there is one remaining bug in nft that Josh Hunt found which
should go into 3.20/4.0, it's not in -net tree, so it could have
affected testing with nft. We're currently missing max_shift
parameter in nft's rhashtable initialization.

This means that there will be _no_ expansions as rht_grow_above_75()
will end up always returning false. It's not that dramatic when you
have a proper hint provided, but when you start out small
(NFT_HASH_ELEMENT_HINT = 3) and try to squeeze 1M entries into it,
it might take a while.

Simplest fix would be, similarly as in other users:

diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 61e6c40..47abdca 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -192,6 +192,7 @@ static int nft_hash_init(const struct nft_set *set,
  		.key_offset = offsetof(struct nft_hash_elem, key),
  		.key_len = set->klen,
  		.hashfn = jhash,
+		.max_shift = 20, /* 1M */
  		.grow_decision = rht_grow_above_75,
  		.shrink_decision = rht_shrink_below_30,
  	};

But I presume Josh wanted to resend his code ... or wait for nft
folks to further review?

> I agree that rhashtable should be improved to better handle many
> inserts on small tables but wanted to make sure whether anyone thinks
> this needs urgent fixing in 3.20 or whether we can take some time to
> fully understand all scenarios and experiment with ideas and then
> propose solutions for the next merge window. I also have the TCP hash
> tables on my radar while improving this.

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24  9:38   ` Daniel Borkmann
@ 2015-02-24 10:42     ` Patrick McHardy
  2015-02-24 16:14       ` Josh Hunt
  0 siblings, 1 reply; 33+ messages in thread
From: Patrick McHardy @ 2015-02-24 10:42 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Thomas Graf, Paul E. McKenney, davem, alexei.starovoitov, herbert,
	ying.xue, netdev, netfilter-devel, josh, johunt

On 24.02, Daniel Borkmann wrote:
> On 02/24/2015 09:59 AM, Thomas Graf wrote:
> >On 02/23/15 at 10:49am, Paul E. McKenney wrote:
> >>Hello!
> >>
> >>Alexei mentioned that there was some excitement a couple of weeks ago in
> >>Ottawa, something about the resizing taking forever when there were large
> >>numbers of concurrent additions.  One approach comes to mind:
> >
> >@Dave et al,
> >Just want to make sure we have the same level of understanding of
> >urgency for this. The only practical problem experienced so far is
> >loading n*1M entries into an nft_hash set which takes 3s for 4M
> >entries upon restore. A regular add is actually fine as it provides
> >a hint and sizes the table accordingly.
> 
> Btw, there is one remaining bug in nft that Josh Hunt found which
> should go into 3.20/4.0, it's not in -net tree, so it could have
> affected testing with nft. We're currently missing max_shift
> parameter in nft's rhashtable initialization.
> 
> This means that there will be _no_ expansions as rht_grow_above_75()
> will end up always returning false. It's not that dramatic when you
> have a proper hint provided, but when you start out small
> (NFT_HASH_ELEMENT_HINT = 3) and try to squeeze 1M entries into it,
> it might take a while.
> 
> Simplest fix would be, similarly as in other users:
> 
> diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
> index 61e6c40..47abdca 100644
> --- a/net/netfilter/nft_hash.c
> +++ b/net/netfilter/nft_hash.c
> @@ -192,6 +192,7 @@ static int nft_hash_init(const struct nft_set *set,
>  		.key_offset = offsetof(struct nft_hash_elem, key),
>  		.key_len = set->klen,
>  		.hashfn = jhash,
> +		.max_shift = 20, /* 1M */
>  		.grow_decision = rht_grow_above_75,
>  		.shrink_decision = rht_shrink_below_30,
>  	};
> 
> But I presume Josh wanted to resend his code ... or wait for nft
> folks to further review?

We're perfectly fine with that patch, although I'd say lets use a
slightly larger value (24) to cover what we know people are doing
using ipset.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 10:42     ` Patrick McHardy
@ 2015-02-24 16:14       ` Josh Hunt
  2015-02-24 16:25         ` Patrick McHardy
  0 siblings, 1 reply; 33+ messages in thread
From: Josh Hunt @ 2015-02-24 16:14 UTC (permalink / raw)
  To: Patrick McHardy, Daniel Borkmann
  Cc: Thomas Graf, Paul E. McKenney, davem, alexei.starovoitov, herbert,
	ying.xue, netdev, netfilter-devel, josh

On 02/24/2015 04:42 AM, Patrick McHardy wrote:
> On 24.02, Daniel Borkmann wrote:
>> On 02/24/2015 09:59 AM, Thomas Graf wrote:
>>> On 02/23/15 at 10:49am, Paul E. McKenney wrote:
>>>> Hello!
>>>>
>>>> Alexei mentioned that there was some excitement a couple of weeks ago in
>>>> Ottawa, something about the resizing taking forever when there were large
>>>> numbers of concurrent additions.  One approach comes to mind:
>>>
>>> @Dave et al,
>>> Just want to make sure we have the same level of understanding of
>>> urgency for this. The only practical problem experienced so far is
>>> loading n*1M entries into an nft_hash set which takes 3s for 4M
>>> entries upon restore. A regular add is actually fine as it provides
>>> a hint and sizes the table accordingly.
>>
>> Btw, there is one remaining bug in nft that Josh Hunt found which
>> should go into 3.20/4.0, it's not in -net tree, so it could have
>> affected testing with nft. We're currently missing max_shift
>> parameter in nft's rhashtable initialization.
>>
>> This means that there will be _no_ expansions as rht_grow_above_75()
>> will end up always returning false. It's not that dramatic when you
>> have a proper hint provided, but when you start out small
>> (NFT_HASH_ELEMENT_HINT = 3) and try to squeeze 1M entries into it,
>> it might take a while.
>>
>> Simplest fix would be, similarly as in other users:
>>
>> diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
>> index 61e6c40..47abdca 100644
>> --- a/net/netfilter/nft_hash.c
>> +++ b/net/netfilter/nft_hash.c
>> @@ -192,6 +192,7 @@ static int nft_hash_init(const struct nft_set *set,
>>   		.key_offset = offsetof(struct nft_hash_elem, key),
>>   		.key_len = set->klen,
>>   		.hashfn = jhash,
>> +		.max_shift = 20, /* 1M */
>>   		.grow_decision = rht_grow_above_75,
>>   		.shrink_decision = rht_shrink_below_30,
>>   	};
>>
>> But I presume Josh wanted to resend his code ... or wait for nft
>> folks to further review?
>
> We're perfectly fine with that patch, although I'd say lets use a
> slightly larger value (24) to cover what we know people are doing
> using ipset.
>

I just sent a patch similar to Daniel's in the set 'nft hash resize 
fixes' using a max_shift value of 24. I still think this value should be 
tunable, but sent the patch to fix the immediate expansion problem for now.

Josh

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 16:14       ` Josh Hunt
@ 2015-02-24 16:25         ` Patrick McHardy
  2015-02-24 16:57           ` David Miller
  0 siblings, 1 reply; 33+ messages in thread
From: Patrick McHardy @ 2015-02-24 16:25 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Daniel Borkmann, Thomas Graf, Paul E. McKenney, davem,
	alexei.starovoitov, herbert, ying.xue, netdev, netfilter-devel,
	josh

On 24.02, Josh Hunt wrote:
> On 02/24/2015 04:42 AM, Patrick McHardy wrote:
> >On 24.02, Daniel Borkmann wrote:
> >>
> >>Simplest fix would be, similarly as in other users:
> >>
> >>diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
> >>index 61e6c40..47abdca 100644
> >>--- a/net/netfilter/nft_hash.c
> >>+++ b/net/netfilter/nft_hash.c
> >>@@ -192,6 +192,7 @@ static int nft_hash_init(const struct nft_set *set,
> >>  		.key_offset = offsetof(struct nft_hash_elem, key),
> >>  		.key_len = set->klen,
> >>  		.hashfn = jhash,
> >>+		.max_shift = 20, /* 1M */
> >>  		.grow_decision = rht_grow_above_75,
> >>  		.shrink_decision = rht_shrink_below_30,
> >>  	};
> >>
> >>But I presume Josh wanted to resend his code ... or wait for nft
> >>folks to further review?
> >
> >We're perfectly fine with that patch, although I'd say lets use a
> >slightly larger value (24) to cover what we know people are doing
> >using ipset.
> 
> I just sent a patch similar to Daniel's in the set 'nft hash resize fixes'
> using a max_shift value of 24. I still think this value should be tunable,
> but sent the patch to fix the immediate expansion problem for now.

Thanks. I actually don't think we should require that parameter at
all, any limits shouldn't be imposed by the data structure but by
the code using it. We're perfectly fine using the available memory
as a limit if the users wants this, provided that hash expansion
succeeds and the lookups stay fast.

But for now let's just fix the immediate problem, I agree.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-24 16:25         ` Patrick McHardy
@ 2015-02-24 16:57           ` David Miller
  0 siblings, 0 replies; 33+ messages in thread
From: David Miller @ 2015-02-24 16:57 UTC (permalink / raw)
  To: kaber
  Cc: johunt, daniel, tgraf, paulmck, alexei.starovoitov, herbert,
	ying.xue, netdev, netfilter-devel, josh

From: Patrick McHardy <kaber@trash.net>
Date: Tue, 24 Feb 2015 16:25:22 +0000

> Thanks. I actually don't think we should require that parameter at
> all, any limits shouldn't be imposed by the data structure but by
> the code using it. We're perfectly fine using the available memory
> as a limit if the users wants this, provided that hash expansion
> succeeds and the lookups stay fast.
> 
> But for now let's just fix the immediate problem, I agree.

I agree, the behavior when the value is unspecified should be to grow
as needed without any limits.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
@ 2015-02-23 22:17 Alexei Starovoitov
  2015-02-23 22:34 ` David Miller
  2015-02-23 22:37 ` Paul E. McKenney
  0 siblings, 2 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2015-02-23 22:17 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Thomas Graf, Josh Triplett, Herbert Xu, Patrick McHardy,
	David S. Miller, ying.xue, netdev@vger.kernel.org,
	netfilter-devel

On Mon, Feb 23, 2015 at 1:52 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Feb 23, 2015 at 09:03:58PM +0000, Thomas Graf wrote:
>> On 02/23/15 at 11:12am, josh@joshtriplett.org wrote:
>> > In theory, resizes should only take the locks for the buckets they're
>> > currently unzipping, and adds should take those same locks.  Neither one
>> > should take a whole-table lock, other than resize excluding concurrent
>> > resizes.  Is that still insufficient?
>>
>> Correct, this is what happens. The problem is basically that
>> if we insert from atomic context we cannot slow down inserts
>> and the table may not grow quickly enough.
>>
>> > Yeah, the add/remove statistics used for tracking would need some
>> > special handling to avoid being a table-wide bottleneck.
>>
>> Daniel is working on a patch to do per-cpu element counting
>> with a batched update cycle.
>
> One approach is simply to count only when a resize operation is in
> flight.  Another is to keep a per-bucket count, which can be summed
> at the beginning of the next resize operation.

I'm not sure all of these counting optimizations will help at the end.
Say we have a small table. A lot of inserts are coming in at the same
time. rhashtable_expand kicks in and all new inserts are going
into future table, while expansion is happening.
Since expand will kick in quickly the old table will not have long
chains per bucket, so few unzips and corresponding
synchronize_rcu and we're done with expand.
Now future table becomes the only table, but it still has a lot
of entries, since insertions were happening and this table has
long per bucket chains, so next expand will have a lot of
synchronize_rcu and will take very long time.
So whether we count while inserting or not and whether
we grow by 2x or grow by 8x we still have an underlying
problem of very large number of synchronize_rcu.
Malicious user that knows this can stall the whole system.
Please tell me I'm missing something.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 22:17 Alexei Starovoitov
@ 2015-02-23 22:34 ` David Miller
  2015-02-23 22:37 ` Paul E. McKenney
  1 sibling, 0 replies; 33+ messages in thread
From: David Miller @ 2015-02-23 22:34 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: paulmck, tgraf, josh, herbert, kaber, ying.xue, netdev,
	netfilter-devel

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Mon, 23 Feb 2015 14:17:06 -0800

> I'm not sure all of these counting optimizations will help at the end.
> Say we have a small table. A lot of inserts are coming in at the same
> time. rhashtable_expand kicks in and all new inserts are going
> into future table, while expansion is happening.
> Since expand will kick in quickly the old table will not have long
> chains per bucket, so few unzips and corresponding
> synchronize_rcu and we're done with expand.
> Now future table becomes the only table, but it still has a lot
> of entries, since insertions were happening and this table has
> long per bucket chains, so next expand will have a lot of
> synchronize_rcu and will take very long time.
> So whether we count while inserting or not and whether
> we grow by 2x or grow by 8x we still have an underlying
> problem of very large number of synchronize_rcu.
> Malicious user that knows this can stall the whole system.
> Please tell me I'm missing something.

This is why I have just suggested that we make inserts block,
and the expander looks at the count of pending inserts in an
effort to keep expanding the table further if necessary before
releasing the blocked insertion threads.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 22:17 Alexei Starovoitov
  2015-02-23 22:34 ` David Miller
@ 2015-02-23 22:37 ` Paul E. McKenney
  1 sibling, 0 replies; 33+ messages in thread
From: Paul E. McKenney @ 2015-02-23 22:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Graf, Josh Triplett, Herbert Xu, Patrick McHardy,
	David S. Miller, ying.xue, netdev@vger.kernel.org,
	netfilter-devel

On Mon, Feb 23, 2015 at 02:17:06PM -0800, Alexei Starovoitov wrote:
> On Mon, Feb 23, 2015 at 1:52 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, Feb 23, 2015 at 09:03:58PM +0000, Thomas Graf wrote:
> >> On 02/23/15 at 11:12am, josh@joshtriplett.org wrote:
> >> > In theory, resizes should only take the locks for the buckets they're
> >> > currently unzipping, and adds should take those same locks.  Neither one
> >> > should take a whole-table lock, other than resize excluding concurrent
> >> > resizes.  Is that still insufficient?
> >>
> >> Correct, this is what happens. The problem is basically that
> >> if we insert from atomic context we cannot slow down inserts
> >> and the table may not grow quickly enough.
> >>
> >> > Yeah, the add/remove statistics used for tracking would need some
> >> > special handling to avoid being a table-wide bottleneck.
> >>
> >> Daniel is working on a patch to do per-cpu element counting
> >> with a batched update cycle.
> >
> > One approach is simply to count only when a resize operation is in
> > flight.  Another is to keep a per-bucket count, which can be summed
> > at the beginning of the next resize operation.
> 
> I'm not sure all of these counting optimizations will help at the end.
> Say we have a small table. A lot of inserts are coming in at the same
> time. rhashtable_expand kicks in and all new inserts are going
> into future table, while expansion is happening.
> Since expand will kick in quickly the old table will not have long
> chains per bucket, so few unzips and corresponding
> synchronize_rcu and we're done with expand.
> Now future table becomes the only table, but it still has a lot
> of entries, since insertions were happening and this table has
> long per bucket chains, so next expand will have a lot of
> synchronize_rcu and will take very long time.
> So whether we count while inserting or not and whether
> we grow by 2x or grow by 8x we still have an underlying
> problem of very large number of synchronize_rcu.
> Malicious user that knows this can stall the whole system.
> Please tell me I'm missing something.

It is quite possible that you are missing nothing, but the hope is
that by growing by larger factors, we reduce the lengths of the chains
in the new table, thus limiting the slowdown on subsequent resizes.

If worst comes to worst, of course, there are a couple of other
RCU protected resizable hash tables.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
@ 2015-02-23 23:07 Alexei Starovoitov
  2015-02-23 23:15 ` David Miller
  0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2015-02-23 23:07 UTC (permalink / raw)
  To: David Miller
  Cc: Paul McKenney, Thomas Graf, Josh Triplett, Herbert Xu,
	Patrick McHardy, ying.xue, netdev@vger.kernel.org,
	netfilter-devel

On Mon, Feb 23, 2015 at 2:34 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Date: Mon, 23 Feb 2015 14:17:06 -0800
>
>> I'm not sure all of these counting optimizations will help at the end.
>> Say we have a small table. A lot of inserts are coming in at the same
>> time. rhashtable_expand kicks in and all new inserts are going
>> into future table, while expansion is happening.
>> Since expand will kick in quickly the old table will not have long
>> chains per bucket, so few unzips and corresponding
>> synchronize_rcu and we're done with expand.
>> Now future table becomes the only table, but it still has a lot
>> of entries, since insertions were happening and this table has
>> long per bucket chains, so next expand will have a lot of
>> synchronize_rcu and will take very long time.
>> So whether we count while inserting or not and whether
>> we grow by 2x or grow by 8x we still have an underlying
>> problem of very large number of synchronize_rcu.
>> Malicious user that knows this can stall the whole system.
>> Please tell me I'm missing something.
>
> This is why I have just suggested that we make inserts block,
> and the expander looks at the count of pending inserts in an
> effort to keep expanding the table further if necessary before
> releasing the blocked insertion threads.

yes. blocking inserts would solve the problem, but it will make
rhashtable applicability to be limited.
Also 'count of pending inserts' is going to be very small and
meaningless, since it will count the number of threads that
are sleeping on insert and not very useful to predict future
expansions.
As an alternative we can store two chains of pointers
per element (one chain for old table and another chain
for future table). That will avoid all unzipping logic at the cost
of extra memory. May be there are other solutions.
I think as a minimum something like 'blocking inserts' are
needed for stable. imo the current situation with nft_hash
is not just annoying, it's a bug.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Ottawa and slow hash-table resize
  2015-02-23 23:07 Alexei Starovoitov
@ 2015-02-23 23:15 ` David Miller
  0 siblings, 0 replies; 33+ messages in thread
From: David Miller @ 2015-02-23 23:15 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: paulmck, tgraf, josh, herbert, kaber, ying.xue, netdev,
	netfilter-devel

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Mon, 23 Feb 2015 15:07:20 -0800

> Also 'count of pending inserts' is going to be very small and
> meaningless, since it will count the number of threads that
> are sleeping on insert and not very useful to predict future
> expansions.

Good point.

A way around this would be to allow batching.  So netfilter would
do something like:

	set->ops->batch_insert_start();
	nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) {
		err = nft_add_set_elem(&ctx, set, attr);
		if (err < 0)
			break;

		set->nelems++;
	}
	set->ops->batch_insert_end();

And batch_insert_start and batch_insert_end would point to something
that calls new routines rhashtable_batch_insert_start() and
rhashtable_batch_insert_end().

Inside of this sequence, rhashtable_insert() calls queue to a list if
a table grow is in progress.

The pending list is processed by both rhashtable_batch_insert_end()
and table grow completion.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2015-02-25 17:38 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-23 18:49 Ottawa and slow hash-table resize Paul E. McKenney
2015-02-23 19:12 ` josh
2015-02-23 21:03   ` Thomas Graf
2015-02-23 21:52     ` Paul E. McKenney
2015-02-23 22:32       ` David Miller
2015-02-23 23:06         ` Paul E. McKenney
2015-02-24  8:37           ` Thomas Graf
2015-02-24 10:39             ` Patrick McHardy
2015-02-24 10:46               ` David Laight
2015-02-24 10:48                 ` Patrick McHardy
2015-02-24 17:09               ` David Miller
2015-02-24 17:50                 ` Thomas Graf
2015-02-24 18:26                   ` David Miller
2015-02-24 18:45                     ` josh
2015-02-24 22:34                       ` Thomas Graf
2015-02-25  8:56                         ` Herbert Xu
2015-02-25 17:38                           ` Thomas Graf
2015-02-24 18:33                   ` josh
2015-02-25  8:55                 ` Herbert Xu
2015-02-25 17:38                   ` Thomas Graf
2015-02-23 21:00 ` Thomas Graf
2015-02-23 22:35   ` Paul E. McKenney
2015-02-24  8:59 ` Thomas Graf
2015-02-24  9:38   ` Daniel Borkmann
2015-02-24 10:42     ` Patrick McHardy
2015-02-24 16:14       ` Josh Hunt
2015-02-24 16:25         ` Patrick McHardy
2015-02-24 16:57           ` David Miller
  -- strict thread matches above, loose matches on Subject: below --
2015-02-23 22:17 Alexei Starovoitov
2015-02-23 22:34 ` David Miller
2015-02-23 22:37 ` Paul E. McKenney
2015-02-23 23:07 Alexei Starovoitov
2015-02-23 23:15 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).