Re: RCU latency regression in 2.6.16-rc1

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@us.ibm.com>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: dipankar@in.ibm.com, Lee Revell <rlrevell@joe-job.com>,
	Ingo Molnar <mingo@elte.hu>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@osdl.org>
Subject: Re: RCU latency regression in 2.6.16-rc1
Date: Sun, 29 Jan 2006 21:11:56 -0800	[thread overview]
Message-ID: <20060130051156.GK16585@us.ibm.com> (raw)
In-Reply-To: <43DD9C49.4000000@cosmosbay.com>

On Mon, Jan 30, 2006 at 05:55:37AM +0100, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
> >On Sat, Jan 28, 2006 at 08:52:02PM +0100, Eric Dumazet wrote:

[ . . . ]

> >>Some machines have millions of entries in their route cache.
> >>
> >>I suspect we cannot queue all them (or only hash heads as your previous 
> >>patch) by RCU. Latencies and/or OOM can occur.
> >>
> >>What can be done is :
> >>
> >>in rt_run_flush(), allocate a new empty hash table, and exchange the hash 
> >>tables.
> >>
> >>Then wait a quiescent/grace RCU period (may be the exact term is not this 
> >>one, sorry, I'm not RCU expert)
> >>
> >>Then free all the entries from the old hash table (direclty of course, no 
> >>need for RCU grace period), and free the hash table.
> >>
> >>As the hash table can be huge, we might need allocate it at boot time, 
> >>just in case a flush is needed (it usually is :) ). If we choose dynamic 
> >>allocation and this allocation fails, then fallback to what is done today.
> >
> >Interesting approach!
> >
> >If I remember correctly, the point of all of this is to perturb the hash
> >function periodically in order to avoid DoS attacks.  It will likely
> >be necessary to avoid a big performance hit during the transition.
> >One way of doing this, given your two-table scheme, would be to:
> >
> >o	Allocate both tables at boot time, as you suggest above.
> >
> >o	Keep the following additional state:
> >
> >	o	Pointer to the table that is the current table.
> >
> >	o	First valid index (fvl) into the current table -- all
> >		indexes below the fvl correspond to hash buckets that
> >		have been transferred into the non-current table.
> >		In the normal case where the tables are not being
> >		switched, fvl==-1.
> >
> >		(To make the RCU searches work without requiring
> >		tons of explicit memory barriers, there needs to
> >		be a separate fvl for each of the tables.)
> >
> >	o	Parameters defining the hash functions for the current
> >		table and for the non-current table.
> >
> >o	When it is time to switch tables, start removing the entries
> >	in hash bucket #fvl of the current table.  Optionally put them
> >	into the non-current table (or just let them be added as they
> >	are needed.  Only remove a limited number of entries (or,
> >	alternatively, stop removing them after a limited amount of
> >	time).
> >
> >	When the current hash bucket has been completely emptied,
> >	increment fvl, and, if we have not already hit the limit,
> >	continue on the new hash bucket.
> >
> >	When fvl runs off the end of the table, you are done with
> >	the switch.  Update the pointer to reference the other
> >	table.  Important -- do -not- start another switch until
> >	a grace period has elapsed!!!  Otherwise, you will end
> >	up fatally confusing slow readers.
> >
> >o	When searching, if the hash function gives a value less
> >	than fvl, search the non-current table.
> >
> >	If the hash function gives a value equal to fvl, search
> >	the current table, and, if not found, search the non-current
> >	table.
> >
> >	If the hash function gives a value greater than fvl, search
> >	only the current table.  (It may also be necessary to search
> >	the non-current table to allow for races with fvl update.)
> >
> >Does this seem reasonable?
> >
> >						Thanx, Paul
> 
> Well, if as a bonus we are able to expand the size of the hash table, it 
> could be very very good : As of today, the boot time sizing of this hash 
> table is somewhat problematic.
> 
> If the size is expanded by a 2 factor (or a power of too), can your 
> proposal works ?

Yep!!!

Add the following:

o	Add a size variable for each of the tables.  It works best
	if the per-table state is stored with the table itself, for
	example:

	struct hashtbl {
		int size;
		int fvl;
		struct hash_param params;
		struct list_head buckets[0];
	};

o	When switching tables, allocate a new one of the desired size
	and free up the non-current one.  (But remember to wait at least
	one grace period after the last switch before starting this!!!)

o	Compute hash parameters suitable for the new table size.

o	Continue as before.

Note that you are not restricted to power-of-two expansion -- the
hash parameters should handle any desired difference, and in fact
handle contraction as well as expansion.

						Thanx, Paul

next prev parent reply	other threads:[~2006-01-30  5:12 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-01-24  7:52 RCU latency regression in 2.6.16-rc1 Lee Revell
2006-01-24  7:56 ` Ingo Molnar
2006-01-24  7:58   ` Lee Revell
2006-01-24  8:01     ` Ingo Molnar
2006-01-24  8:03       ` Lee Revell
2006-01-24  8:11         ` Ingo Molnar
2006-01-24  8:07       ` Lee Revell
2006-01-24  8:13         ` Ingo Molnar
2006-01-24  8:15           ` Lee Revell
2006-01-24  9:17             ` Paul E. McKenney
2006-01-24  9:23               ` Ingo Molnar
2006-01-24  9:44                 ` Lee Revell
2006-01-24 16:28                   ` Dipankar Sarma
2006-01-24 21:38                     ` Dipankar Sarma
2006-01-25 21:28                       ` Lee Revell
2006-01-25 22:56                         ` Ingo Molnar
2006-01-25 23:13                           ` Lee Revell
2006-01-26 19:18                         ` Paul E. McKenney
2006-01-27 18:55                           ` Lee Revell
2006-01-28 17:03                             ` Dipankar Sarma
2006-01-28 18:00                               ` Lee Revell
2006-01-28 18:51                                 ` Lee Revell
2006-01-28 19:34                                   ` Dipankar Sarma
2006-01-28 19:46                                     ` Lee Revell
2006-01-28 19:52                                     ` Eric Dumazet
2006-01-29  7:38                                       ` Lee Revell
2006-01-29  7:51                                         ` Ingo Molnar
2006-01-29  8:21                                           ` Lee Revell
2006-01-30  4:36                                       ` Paul E. McKenney
2006-01-30  4:55                                         ` Eric Dumazet
2006-01-30  5:11                                           ` Paul E. McKenney [this message]
2006-01-30  5:52                                             ` David S. Miller
2006-01-30 10:00                                               ` Paul E. McKenney
2006-02-12  0:45                                             ` Lee Revell
2006-01-24 16:57 ` Dipankar Sarma

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060130051156.GK16585@us.ibm.com \
    --to=paulmck@us.ibm.com \
    --cc=dada1@cosmosbay.com \
    --cc=dipankar@in.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rlrevell@joe-job.com \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.