netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jarek Poplawski <jarkao2@gmail.com>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: David Miller <davem@davemloft.net>,
	nhorman@tuxdriver.com, lav@yar.ru,
	shemminger@linux-foundation.org, netdev@vger.kernel.org
Subject: Re: [PATCH] net: fix rtable leak in net/ipv4/route.c
Date: Wed, 20 May 2009 10:03:18 +0000	[thread overview]
Message-ID: <20090520100318.GA5789@ff.dom.local> (raw)
In-Reply-To: <4A139FC4.7030309@cosmosbay.com>

On Wed, May 20, 2009 at 08:14:28AM +0200, Eric Dumazet wrote:
> Eric Dumazet a écrit :
> > David Miller a écrit :
> >> From: Neil Horman <nhorman@tuxdriver.com>
> >> Date: Tue, 19 May 2009 15:24:50 -0400
> >>
> >>>> Moving whole group in front would defeat the purpose of move, actually,
> >>>> since rank in chain is used to decay the timeout in garbage collector.
> >>>> (search for tmo >>= 1; )
> >>>>
> >>> Argh, so the list is implicitly ordered by expiration time.  That
> >>> really defeats the entire purpose of doing grouping in the ilst at
> >>> all.  If thats the case, then I agree, its probably better to to
> >>> take the additional visitation hit in in check_expire above than to
> >>> try and preserve ordering.
> >> Yes, this seems best.
> >>
> >> I was worried that somehow the ordering also influences lookups,
> >> because the TOS bits don't go into the hash so I worried that it would
> >> be important that explicit TOS values appear before wildcard ones.
> >> But it doesn't appear that this is an issue, we don't have wildcard
> >> TOSs in the rtable entries, they are always explicit.
> >>
> >> So I would like to see an explicit final patch from Eric so we can get
> >> this fixed now.
> >>
> > 
> > I would like to split patches because we have two bugs indeed, and
> > I prefer to get attention for both problems, I dont remember Neil acknowledged
> > the length computation problem.
> > 
> > First and small patch, candidate for net-2.6 and stable (for 2.6.29) :
> > 
> 
> Then here is the patch on top on previous one.
> 
> Thanks to all
> 
> [PATCH] net: fix rtable leak in net/ipv4/route.c
> 
> Alexander V. Lukyanov found a regression in 2.6.29 and made a complete
> analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339
> Quoted here because its a perfect one :
> 
> begin_of_quotation 
>  2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
>  patch has at least one critical flaw, and another problem.
> 
>  rt_intern_hash calculates rthi pointer, which is later used for new entry
>  insertion. The same loop calculates cand pointer which is used to clean the
>  list. If the pointers are the same, rtable leak occurs, as first the cand is
>  removed then the new entry is appended to it.
> 
>  This leak leads to unregister_netdevice problem (usage count > 0).
> 
>  Another problem of the patch is that it tries to insert the entries in certain
>  order, to facilitate counting of entries distinct by all but QoS parameters.
>  Unfortunately, referencing an existing rtable entry moves it to list beginning,
>  to speed up further lookups, so the carefully built order is destroyed.
> 
>  For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
>  it will also destroy the ordering.
> end_of_quotation
> 
> 
> Problematic commit is 1080d709fb9d8cd4392f93476ee46a9d6ea05a5b
> (net: implement emergency route cache rebulds when gc_elasticity is exceeded)
> 
> Trying to keep dst_entries ordered is too complex and breaks the fact that
> order should depend on the frequency of use for garbage collection.
> 
> A possible fix is to make rt_intern_hash() simpler, and only makes
> rt_check_expire() a litle bit smarter, being able to cope with an arbitrary
> entries order. The added loop is running on cache hot data, while cpu
> is prefetching next object, so should be unnoticied.
> 
> Reported-and-analyzed-by: Alexander V. Lukyanov <lav@yar.ru>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  net/ipv4/route.c |   55 +++++++++++++--------------------------------
>  1 files changed, 17 insertions(+), 38 deletions(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 869cf1c..28205e5 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -784,7 +784,7 @@ static void rt_check_expire(void)
>  {
>  	static unsigned int rover;
>  	unsigned int i = rover, goal;
> -	struct rtable *rth, **rthp;
> +	struct rtable *rth, *aux, **rthp;
>  	unsigned long samples = 0;
>  	unsigned long sum = 0, sum2 = 0;
>  	u64 mult;
> @@ -812,6 +812,7 @@ static void rt_check_expire(void)
>  		length = 0;
>  		spin_lock_bh(rt_hash_lock_addr(i));
>  		while ((rth = *rthp) != NULL) {
> +			prefetch(rth->u.dst.rt_next);
>  			if (rt_is_expired(rth)) {
>  				*rthp = rth->u.dst.rt_next;
>  				rt_free(rth);
> @@ -820,33 +821,30 @@ static void rt_check_expire(void)
>  			if (rth->u.dst.expires) {
>  				/* Entry is expired even if it is in use */
>  				if (time_before_eq(jiffies, rth->u.dst.expires)) {
> +nofree:
>  					tmo >>= 1;
>  					rthp = &rth->u.dst.rt_next;
>  					/*
> -					 * Only bump our length if the hash
> -					 * inputs on entries n and n+1 are not
> -					 * the same, we only count entries on
> +					 * We only count entries on
>  					 * a chain with equal hash inputs once
>  					 * so that entries for different QOS
>  					 * levels, and other non-hash input
>  					 * attributes don't unfairly skew
>  					 * the length computation
>  					 */
> -					if ((*rthp == NULL) ||
> -					    !compare_hash_inputs(&(*rthp)->fl,
> -								 &rth->fl))
> -						length += ONE;
> +					for (aux = rt_hash_table[i].chain;;) {
> +						if (aux == rth) {
> +							length += ONE;
> +							break;
> +						}
> +						if (compare_hash_inputs(&aux->fl, &rth->fl))
> +							break;
> +						aux = aux->u.dst.rt_next;
> +					}

Very "interesting" for() usage, but isn't it more readable like this?:

					aux = rt_hash_table[i].chain;
					while (aux != rth) {
						if (compare_hash_inputs(&aux->fl, &rth->fl))
							break;
						aux = aux->u.dst.rt_next;
					}

					if (aux == rth)
						length += ONE;

Jarek P.

>  					continue;
>  				}
> -			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
> -				tmo >>= 1;
> -				rthp = &rth->u.dst.rt_next;
> -				if ((*rthp == NULL) ||
> -				    !compare_hash_inputs(&(*rthp)->fl,
> -							 &rth->fl))
> -					length += ONE;
> -				continue;
> -			}
> +			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout))
> +				goto nofree;
>  
>  			/* Cleanup aged off entries. */
>  			*rthp = rth->u.dst.rt_next;
> @@ -1069,7 +1067,6 @@ out:	return 0;
>  static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
>  {
>  	struct rtable	*rth, **rthp;
> -	struct rtable	*rthi;
>  	unsigned long	now;
>  	struct rtable *cand, **candp;
>  	u32 		min_score;
> @@ -1089,7 +1086,6 @@ restart:
>  	}
>  
>  	rthp = &rt_hash_table[hash].chain;
> -	rthi = NULL;
>  
>  	spin_lock_bh(rt_hash_lock_addr(hash));
>  	while ((rth = *rthp) != NULL) {
> @@ -1135,17 +1131,6 @@ restart:
>  		chain_length++;
>  
>  		rthp = &rth->u.dst.rt_next;
> -
> -		/*
> -		 * check to see if the next entry in the chain
> -		 * contains the same hash input values as rt.  If it does
> -		 * This is where we will insert into the list, instead of
> -		 * at the head.  This groups entries that differ by aspects not
> -		 * relvant to the hash function together, which we use to adjust
> -		 * our chain length
> -		 */
> -		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
> -			rthi = rth;
>  	}
>  
>  	if (cand) {
> @@ -1206,10 +1191,7 @@ restart:
>  		}
>  	}
>  
> -	if (rthi)
> -		rt->u.dst.rt_next = rthi->u.dst.rt_next;
> -	else
> -		rt->u.dst.rt_next = rt_hash_table[hash].chain;
> +	rt->u.dst.rt_next = rt_hash_table[hash].chain;
>  
>  #if RT_CACHE_DEBUG >= 2
>  	if (rt->u.dst.rt_next) {
> @@ -1225,10 +1207,7 @@ restart:
>  	 * previous writes to rt are comitted to memory
>  	 * before making rt visible to other CPUS.
>  	 */
> -	if (rthi)
> -		rcu_assign_pointer(rthi->u.dst.rt_next, rt);
> -	else
> -		rcu_assign_pointer(rt_hash_table[hash].chain, rt);
> +	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
>  
>  	spin_unlock_bh(rt_hash_lock_addr(hash));
>  	*rp = rt;
> 

  reply	other threads:[~2009-05-20 10:09 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-19  2:35 Fw: [Bug 13339] New: rtable leak in ipv4/route.c Stephen Hemminger
2009-05-19 12:34 ` Jarek Poplawski
2009-05-19 15:12   ` Neil Horman
2009-05-19 15:32   ` Eric Dumazet
2009-05-19 16:20     ` Neil Horman
2009-05-19 18:47       ` Eric Dumazet
2009-05-19 19:24         ` Neil Horman
2009-05-19 22:05           ` David Miller
2009-05-19 23:05             ` Neil Horman
2009-05-20  4:54             ` [PATCH] net: fix length computation in rt_check_expire() Eric Dumazet
2009-05-20  6:13               ` David Miller
2009-05-20  6:14               ` [PATCH] net: fix rtable leak in net/ipv4/route.c Eric Dumazet
2009-05-20 10:03                 ` Jarek Poplawski [this message]
2009-05-20 11:13                   ` Eric Dumazet
2009-05-20 11:37                     ` Jarek Poplawski
2009-05-20 10:48                 ` Neil Horman
2009-05-21  0:19                   ` David Miller
2009-05-20 10:27               ` [PATCH] net: fix length computation in rt_check_expire() Neil Horman
2009-05-21  0:19                 ` David Miller
2009-05-19 16:23     ` Fw: [Bug 13339] New: rtable leak in ipv4/route.c Neil Horman
2009-05-19 17:17       ` Jarek Poplawski
2009-05-19 17:45         ` Neil Horman
2009-05-19 17:53           ` Jarek Poplawski
2009-05-19 18:05           ` Jarek Poplawski
2009-05-19 18:16             ` Neil Horman
2009-05-20  6:36               ` Alexander V. Lukyanov
2009-05-19 17:47         ` Jarek Poplawski
2009-05-19 17:22     ` Jarek Poplawski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090520100318.GA5789@ff.dom.local \
    --to=jarkao2@gmail.com \
    --cc=dada1@cosmosbay.com \
    --cc=davem@davemloft.net \
    --cc=lav@yar.ru \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    --cc=shemminger@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).