All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Neil Horman <nhorman@tuxdriver.com>
Cc: Jarek Poplawski <jarkao2@gmail.com>,
	lav@yar.ru, Stephen Hemminger <shemminger@linux-foundation.org>,
	netdev@vger.kernel.org
Subject: Re: Fw: [Bug 13339] New: rtable leak in ipv4/route.c
Date: Tue, 19 May 2009 20:47:54 +0200	[thread overview]
Message-ID: <4A12FEDA.7040806@cosmosbay.com> (raw)
In-Reply-To: <20090519162048.GB28034@hmsreliant.think-freely.org>

Neil Horman a écrit :
> On Tue, May 19, 2009 at 05:32:29PM +0200, Eric Dumazet wrote:
>> Jarek Poplawski a écrit :
>>> On 19-05-2009 04:35, Stephen Hemminger wrote:
>>>> Begin forwarded message:
>>>>
>>>> Date: Mon, 18 May 2009 14:10:20 GMT
>>>> From: bugzilla-daemon@bugzilla.kernel.org
>>>> To: shemminger@linux-foundation.org
>>>> Subject: [Bug 13339] New: rtable leak in ipv4/route.c
>>>>
>>>>
>>>> http://bugzilla.kernel.org/show_bug.cgi?id=13339
>>> ...
>>>> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
>>>> patch has at least one critical flaw, and another problem.
>>>>
>>>> rt_intern_hash calculates rthi pointer, which is later used for new entry
>>>> insertion. The same loop calculates cand pointer which is used to clean the
>>>> list. If the pointers are the same, rtable leak occurs, as first the cand is
>>>> removed then the new entry is appended to it.
>>>>
>>>> This leak leads to unregister_netdevice problem (usage count > 0).
>>>>
>>>> Another problem of the patch is that it tries to insert the entries in certain
>>>> order, to facilitate counting of entries distinct by all but QoS parameters.
>>>> Unfortunately, referencing an existing rtable entry moves it to list beginning,
>>>> to speed up further lookups, so the carefully built order is destroyed.
>> We could change rt_check_expire() to be smarter and handle any order in chains.
>>
>> This would let rt_intern_hash() be simpler.
>>
>> As its a more performance critical path, all would be good :)
>>
>>>> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
>>>> it will also destroy the ordering.
>>> I think fixing this bug fast is more important than this
>>> ordering or counting. Could you send your patch proposal?
>>>
> 
> I was thinking something along these lines (also completely untested).  The
> extra check in rt_intern hash prevents us from identifying a 'group leader' that
> is about to be deleted, and sets rthi to point at the group leader rather than
> the last entry in the group as it previously did, which we then mark with the
> new flag in the dst_entry (updating it on rt_free of course).  This lets us
> identify the boundaries of a group, so when we do a move to front heruistic, we
> can move the entire group rather than just the single entry.    I prefer this to
> Erics approach since, even though rt_check_expire isn't in a performance
> critical path, the addition of the extra list search when the list is unordered
> makes rt_check_expire run in O(n^2) rather than O(n), which I think will have a
> significant impact on high traffic systems.  Since the ordered list creation was
> done at almost zero performance cost in rt_intern_hash, I'd like to keep it if
> at all possible.
> 
> There is of course a shortcomming in my patch in that it doesn't actually do
> anything currently with DST_GRPLDR other than set/clear it appropriately, I've
> yet to identify where the route cache move to front heruistic is implemented.
> If someone could show me where that is, I'd be happy to finish this patch.
> 

O(n^2) ? Not exactly ...

N * (Y + x)  instead of N * (Y),  where x <= Y/100, and we can make x being 0
in fact with proper prefetching.

Real cost is bringing into CPU cache the information needed, by two
orders of magnitude. (one cache line miss : more than 200 cycles,
and we need at least two cache lines per rtable.)

Adding 'group' notion to dst entries is overengineering an already
too complex ipv4/route.c file, that is almost unreadable these days,
since even you can not spot the point where we move the entry at chain
front.

Hint : search for /* Put it first */ in rt_intern_hash()

Moving whole group in front would defeat the purpose of move, actually,
since rank in chain is used to decay the timeout in garbage collector.
(search for tmo >>= 1; )


Here is is a revised patch, doing the length computation inside
the loop, while cpu is prefeching next entry into its cache.

BTW, we actually have another bug, since length is set to 0 at the wrong place
(before the for (; goal > 0; goal--) loop)

We must of course reset length to 0 for each chain.

All credits for Alexander V. Lukyanov for doing the analysis of course :)


 net/ipv4/route.c |   57 ++++++++++++++-------------------------------
 1 files changed, 18 insertions(+), 39 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..769f72e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -784,7 +784,7 @@ static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
-	struct rtable *rth, **rthp;
+	struct rtable *rth, *aux, **rthp;
 	unsigned long length = 0, samples = 0;
 	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
@@ -795,7 +795,6 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
-	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -809,8 +808,10 @@ static void rt_check_expire(void)
 
 		if (*rthp == NULL)
 			continue;
+		length = 0;
 		spin_lock_bh(rt_hash_lock_addr(i));
 		while ((rth = *rthp) != NULL) {
+			prefetch(rth->u.dst.rt_next);
 			if (rt_is_expired(rth)) {
 				*rthp = rth->u.dst.rt_next;
 				rt_free(rth);
@@ -819,33 +820,30 @@ static void rt_check_expire(void)
 			if (rth->u.dst.expires) {
 				/* Entry is expired even if it is in use */
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
+nofree:
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
 					/*
-					 * Only bump our length if the hash
-					 * inputs on entries n and n+1 are not
-					 * the same, we only count entries on
+					 * We only count entries on
 					 * a chain with equal hash inputs once
 					 * so that entries for different QOS
 					 * levels, and other non-hash input
 					 * attributes don't unfairly skew
 					 * the length computation
 					 */
-					if ((*rthp == NULL) ||
-					    !compare_hash_inputs(&(*rthp)->fl,
-								 &rth->fl))
-						length += ONE;
+					for (aux = rt_hash_table[i].chain;;) {
+						if (aux == rth) {
+							length += ONE;
+							break;
+						}
+						if (compare_hash_inputs(&aux->fl, &rth->fl))
+							break;
+						aux = aux->u.dst.rt_next;
+					}
 					continue;
 				}
-			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
-				tmo >>= 1;
-				rthp = &rth->u.dst.rt_next;
-				if ((*rthp == NULL) ||
-				    !compare_hash_inputs(&(*rthp)->fl,
-							 &rth->fl))
-					length += ONE;
-				continue;
-			}
+			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout))
+				goto nofree;
 
 			/* Cleanup aged off entries. */
 			*rthp = rth->u.dst.rt_next;
@@ -1068,7 +1066,6 @@ out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
-	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -1088,7 +1085,6 @@ restart:
 	}
 
 	rthp = &rt_hash_table[hash].chain;
-	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1134,17 +1130,6 @@ restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
-
-		/*
-		 * check to see if the next entry in the chain
-		 * contains the same hash input values as rt.  If it does
-		 * This is where we will insert into the list, instead of
-		 * at the head.  This groups entries that differ by aspects not
-		 * relvant to the hash function together, which we use to adjust
-		 * our chain length
-		 */
-		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
-			rthi = rth;
 	}
 
 	if (cand) {
@@ -1205,10 +1190,7 @@ restart:
 		}
 	}
 
-	if (rthi)
-		rt->u.dst.rt_next = rthi->u.dst.rt_next;
-	else
-		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	rt->u.dst.rt_next = rt_hash_table[hash].chain;
 
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
@@ -1224,10 +1206,7 @@ restart:
 	 * previous writes to rt are comitted to memory
 	 * before making rt visible to other CPUS.
 	 */
-	if (rthi)
-		rcu_assign_pointer(rthi->u.dst.rt_next, rt);
-	else
-		rcu_assign_pointer(rt_hash_table[hash].chain, rt);
+	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
 
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;


  reply	other threads:[~2009-05-19 18:48 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-19  2:35 Fw: [Bug 13339] New: rtable leak in ipv4/route.c Stephen Hemminger
2009-05-19 12:34 ` Jarek Poplawski
2009-05-19 15:12   ` Neil Horman
2009-05-19 15:32   ` Eric Dumazet
2009-05-19 16:20     ` Neil Horman
2009-05-19 18:47       ` Eric Dumazet [this message]
2009-05-19 19:24         ` Neil Horman
2009-05-19 22:05           ` David Miller
2009-05-19 23:05             ` Neil Horman
2009-05-20  4:54             ` [PATCH] net: fix length computation in rt_check_expire() Eric Dumazet
2009-05-20  6:13               ` David Miller
2009-05-20  6:14               ` [PATCH] net: fix rtable leak in net/ipv4/route.c Eric Dumazet
2009-05-20 10:03                 ` Jarek Poplawski
2009-05-20 11:13                   ` Eric Dumazet
2009-05-20 11:37                     ` Jarek Poplawski
2009-05-20 10:48                 ` Neil Horman
2009-05-21  0:19                   ` David Miller
2009-05-20 10:27               ` [PATCH] net: fix length computation in rt_check_expire() Neil Horman
2009-05-21  0:19                 ` David Miller
2009-05-19 16:23     ` Fw: [Bug 13339] New: rtable leak in ipv4/route.c Neil Horman
2009-05-19 17:17       ` Jarek Poplawski
2009-05-19 17:45         ` Neil Horman
2009-05-19 17:53           ` Jarek Poplawski
2009-05-19 18:05           ` Jarek Poplawski
2009-05-19 18:16             ` Neil Horman
2009-05-20  6:36               ` Alexander V. Lukyanov
2009-05-19 17:47         ` Jarek Poplawski
2009-05-19 17:22     ` Jarek Poplawski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A12FEDA.7040806@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=jarkao2@gmail.com \
    --cc=lav@yar.ru \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    --cc=shemminger@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.