netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Jarek Poplawski <jarkao2@gmail.com>
Cc: lav@yar.ru, Stephen Hemminger <shemminger@linux-foundation.org>,
	netdev@vger.kernel.org, Neil Horman <nhorman@tuxdriver.com>
Subject: Re: Fw: [Bug 13339] New: rtable leak in ipv4/route.c
Date: Tue, 19 May 2009 17:32:29 +0200	[thread overview]
Message-ID: <4A12D10D.3000504@cosmosbay.com> (raw)
In-Reply-To: <20090519123417.GA7376@ff.dom.local>

Jarek Poplawski a écrit :
> On 19-05-2009 04:35, Stephen Hemminger wrote:
>> Begin forwarded message:
>>
>> Date: Mon, 18 May 2009 14:10:20 GMT
>> From: bugzilla-daemon@bugzilla.kernel.org
>> To: shemminger@linux-foundation.org
>> Subject: [Bug 13339] New: rtable leak in ipv4/route.c
>>
>>
>> http://bugzilla.kernel.org/show_bug.cgi?id=13339
> ...
>> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
>> patch has at least one critical flaw, and another problem.
>>
>> rt_intern_hash calculates rthi pointer, which is later used for new entry
>> insertion. The same loop calculates cand pointer which is used to clean the
>> list. If the pointers are the same, rtable leak occurs, as first the cand is
>> removed then the new entry is appended to it.
>>
>> This leak leads to unregister_netdevice problem (usage count > 0).
>>
>> Another problem of the patch is that it tries to insert the entries in certain
>> order, to facilitate counting of entries distinct by all but QoS parameters.
>> Unfortunately, referencing an existing rtable entry moves it to list beginning,
>> to speed up further lookups, so the carefully built order is destroyed.

We could change rt_check_expire() to be smarter and handle any order in chains.

This would let rt_intern_hash() be simpler.

As its a more performance critical path, all would be good :)

>>
>> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
>> it will also destroy the ordering.
> 
> I think fixing this bug fast is more important than this
> ordering or counting. Could you send your patch proposal?
> 

Here is mine, only compiled, not tested yet.

All credits for Stephen for doing the analysis of course :)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..fbe77ad 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -780,12 +780,37 @@ static void rt_do_flush(int process_context)
 #define FRACT_BITS 3
 #define ONE (1UL << FRACT_BITS)
 
+static unsigned long compute_length(struct rtable *head)
+{
+	struct rtable *rth, *aux;
+	unsigned long length = 0;
+
+	for (rth = head; rth != NULL; rth = rth->u.dst.rt_next) {
+		/*
+		 * We ignore items having same hash inputs
+		 * so that entries for different QOS
+		 * levels, and other non-hash input
+		 * attributes don't unfairly skew
+		 * the length computation
+		 */
+		for (aux = head; ;aux = aux->u.dst.rt_next) {
+			if (aux == rth) {
+				length += ONE;
+				break;
+			}
+			if (compare_hash_inputs(&aux->fl, &rth->fl))
+				break;
+		}
+	}
+	return length;
+}
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
-	unsigned long length = 0, samples = 0;
+	unsigned long length, samples = 0;
 	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
 
@@ -795,7 +820,6 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
-	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -821,29 +845,11 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
-					/*
-					 * Only bump our length if the hash
-					 * inputs on entries n and n+1 are not
-					 * the same, we only count entries on
-					 * a chain with equal hash inputs once
-					 * so that entries for different QOS
-					 * levels, and other non-hash input
-					 * attributes don't unfairly skew
-					 * the length computation
-					 */
-					if ((*rthp == NULL) ||
-					    !compare_hash_inputs(&(*rthp)->fl,
-								 &rth->fl))
-						length += ONE;
 					continue;
 				}
 			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
 				tmo >>= 1;
 				rthp = &rth->u.dst.rt_next;
-				if ((*rthp == NULL) ||
-				    !compare_hash_inputs(&(*rthp)->fl,
-							 &rth->fl))
-					length += ONE;
 				continue;
 			}
 
@@ -851,6 +857,7 @@ static void rt_check_expire(void)
 			*rthp = rth->u.dst.rt_next;
 			rt_free(rth);
 		}
+		length = compute_length(rt_hash_table[i].chain);
 		spin_unlock_bh(rt_hash_lock_addr(i));
 		sum += length;
 		sum2 += length*length;
@@ -1068,7 +1075,6 @@ out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
-	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -1088,7 +1094,6 @@ restart:
 	}
 
 	rthp = &rt_hash_table[hash].chain;
-	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1134,17 +1139,6 @@ restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
-
-		/*
-		 * check to see if the next entry in the chain
-		 * contains the same hash input values as rt.  If it does
-		 * This is where we will insert into the list, instead of
-		 * at the head.  This groups entries that differ by aspects not
-		 * relvant to the hash function together, which we use to adjust
-		 * our chain length
-		 */
-		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
-			rthi = rth;
 	}
 
 	if (cand) {
@@ -1205,10 +1199,7 @@ restart:
 		}
 	}
 
-	if (rthi)
-		rt->u.dst.rt_next = rthi->u.dst.rt_next;
-	else
-		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	rt->u.dst.rt_next = rt_hash_table[hash].chain;
 
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
@@ -1224,10 +1215,7 @@ restart:
 	 * previous writes to rt are comitted to memory
 	 * before making rt visible to other CPUS.
 	 */
-	if (rthi)
-		rcu_assign_pointer(rthi->u.dst.rt_next, rt);
-	else
-		rcu_assign_pointer(rt_hash_table[hash].chain, rt);
+	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
 
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;


  parent reply	other threads:[~2009-05-19 15:32 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-19  2:35 Fw: [Bug 13339] New: rtable leak in ipv4/route.c Stephen Hemminger
2009-05-19 12:34 ` Jarek Poplawski
2009-05-19 15:12   ` Neil Horman
2009-05-19 15:32   ` Eric Dumazet [this message]
2009-05-19 16:20     ` Neil Horman
2009-05-19 18:47       ` Eric Dumazet
2009-05-19 19:24         ` Neil Horman
2009-05-19 22:05           ` David Miller
2009-05-19 23:05             ` Neil Horman
2009-05-20  4:54             ` [PATCH] net: fix length computation in rt_check_expire() Eric Dumazet
2009-05-20  6:13               ` David Miller
2009-05-20  6:14               ` [PATCH] net: fix rtable leak in net/ipv4/route.c Eric Dumazet
2009-05-20 10:03                 ` Jarek Poplawski
2009-05-20 11:13                   ` Eric Dumazet
2009-05-20 11:37                     ` Jarek Poplawski
2009-05-20 10:48                 ` Neil Horman
2009-05-21  0:19                   ` David Miller
2009-05-20 10:27               ` [PATCH] net: fix length computation in rt_check_expire() Neil Horman
2009-05-21  0:19                 ` David Miller
2009-05-19 16:23     ` Fw: [Bug 13339] New: rtable leak in ipv4/route.c Neil Horman
2009-05-19 17:17       ` Jarek Poplawski
2009-05-19 17:45         ` Neil Horman
2009-05-19 17:53           ` Jarek Poplawski
2009-05-19 18:05           ` Jarek Poplawski
2009-05-19 18:16             ` Neil Horman
2009-05-20  6:36               ` Alexander V. Lukyanov
2009-05-19 17:47         ` Jarek Poplawski
2009-05-19 17:22     ` Jarek Poplawski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A12D10D.3000504@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=jarkao2@gmail.com \
    --cc=lav@yar.ru \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    --cc=shemminger@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).