From: Eric Dumazet <dada1@cosmosbay.com>
To: Neil Horman <nhorman@tuxdriver.com>
Cc: Jarek Poplawski <jarkao2@gmail.com>,
lav@yar.ru, Stephen Hemminger <shemminger@linux-foundation.org>,
netdev@vger.kernel.org
Subject: Re: Fw: [Bug 13339] New: rtable leak in ipv4/route.c
Date: Tue, 19 May 2009 20:47:54 +0200 [thread overview]
Message-ID: <4A12FEDA.7040806@cosmosbay.com> (raw)
In-Reply-To: <20090519162048.GB28034@hmsreliant.think-freely.org>
Neil Horman a écrit :
> On Tue, May 19, 2009 at 05:32:29PM +0200, Eric Dumazet wrote:
>> Jarek Poplawski a écrit :
>>> On 19-05-2009 04:35, Stephen Hemminger wrote:
>>>> Begin forwarded message:
>>>>
>>>> Date: Mon, 18 May 2009 14:10:20 GMT
>>>> From: bugzilla-daemon@bugzilla.kernel.org
>>>> To: shemminger@linux-foundation.org
>>>> Subject: [Bug 13339] New: rtable leak in ipv4/route.c
>>>>
>>>>
>>>> http://bugzilla.kernel.org/show_bug.cgi?id=13339
>>> ...
>>>> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
>>>> patch has at least one critical flaw, and another problem.
>>>>
>>>> rt_intern_hash calculates rthi pointer, which is later used for new entry
>>>> insertion. The same loop calculates cand pointer which is used to clean the
>>>> list. If the pointers are the same, rtable leak occurs, as first the cand is
>>>> removed then the new entry is appended to it.
>>>>
>>>> This leak leads to unregister_netdevice problem (usage count > 0).
>>>>
>>>> Another problem of the patch is that it tries to insert the entries in certain
>>>> order, to facilitate counting of entries distinct by all but QoS parameters.
>>>> Unfortunately, referencing an existing rtable entry moves it to list beginning,
>>>> to speed up further lookups, so the carefully built order is destroyed.
>> We could change rt_check_expire() to be smarter and handle any order in chains.
>>
>> This would let rt_intern_hash() be simpler.
>>
>> As its a more performance critical path, all would be good :)
>>
>>>> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
>>>> it will also destroy the ordering.
>>> I think fixing this bug fast is more important than this
>>> ordering or counting. Could you send your patch proposal?
>>>
>
> I was thinking something along these lines (also completely untested). The
> extra check in rt_intern hash prevents us from identifying a 'group leader' that
> is about to be deleted, and sets rthi to point at the group leader rather than
> the last entry in the group as it previously did, which we then mark with the
> new flag in the dst_entry (updating it on rt_free of course). This lets us
> identify the boundaries of a group, so when we do a move to front heruistic, we
> can move the entire group rather than just the single entry. I prefer this to
> Erics approach since, even though rt_check_expire isn't in a performance
> critical path, the addition of the extra list search when the list is unordered
> makes rt_check_expire run in O(n^2) rather than O(n), which I think will have a
> significant impact on high traffic systems. Since the ordered list creation was
> done at almost zero performance cost in rt_intern_hash, I'd like to keep it if
> at all possible.
>
> There is of course a shortcomming in my patch in that it doesn't actually do
> anything currently with DST_GRPLDR other than set/clear it appropriately, I've
> yet to identify where the route cache move to front heruistic is implemented.
> If someone could show me where that is, I'd be happy to finish this patch.
>
O(n^2) ? Not exactly ...
N * (Y + x) instead of N * (Y), where x <= Y/100, and we can make x being 0
in fact with proper prefetching.
Real cost is bringing into CPU cache the information needed, by two
orders of magnitude. (one cache line miss : more than 200 cycles,
and we need at least two cache lines per rtable.)
Adding 'group' notion to dst entries is overengineering an already
too complex ipv4/route.c file, that is almost unreadable these days,
since even you can not spot the point where we move the entry at chain
front.
Hint : search for /* Put it first */ in rt_intern_hash()
Moving whole group in front would defeat the purpose of move, actually,
since rank in chain is used to decay the timeout in garbage collector.
(search for tmo >>= 1; )
Here is is a revised patch, doing the length computation inside
the loop, while cpu is prefeching next entry into its cache.
BTW, we actually have another bug, since length is set to 0 at the wrong place
(before the for (; goal > 0; goal--) loop)
We must of course reset length to 0 for each chain.
All credits for Alexander V. Lukyanov for doing the analysis of course :)
net/ipv4/route.c | 57 ++++++++++++++-------------------------------
1 files changed, 18 insertions(+), 39 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c4c60e9..769f72e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -784,7 +784,7 @@ static void rt_check_expire(void)
{
static unsigned int rover;
unsigned int i = rover, goal;
- struct rtable *rth, **rthp;
+ struct rtable *rth, *aux, **rthp;
unsigned long length = 0, samples = 0;
unsigned long sum = 0, sum2 = 0;
u64 mult;
@@ -795,7 +795,6 @@ static void rt_check_expire(void)
goal = (unsigned int)mult;
if (goal > rt_hash_mask)
goal = rt_hash_mask + 1;
- length = 0;
for (; goal > 0; goal--) {
unsigned long tmo = ip_rt_gc_timeout;
@@ -809,8 +808,10 @@ static void rt_check_expire(void)
if (*rthp == NULL)
continue;
+ length = 0;
spin_lock_bh(rt_hash_lock_addr(i));
while ((rth = *rthp) != NULL) {
+ prefetch(rth->u.dst.rt_next);
if (rt_is_expired(rth)) {
*rthp = rth->u.dst.rt_next;
rt_free(rth);
@@ -819,33 +820,30 @@ static void rt_check_expire(void)
if (rth->u.dst.expires) {
/* Entry is expired even if it is in use */
if (time_before_eq(jiffies, rth->u.dst.expires)) {
+nofree:
tmo >>= 1;
rthp = &rth->u.dst.rt_next;
/*
- * Only bump our length if the hash
- * inputs on entries n and n+1 are not
- * the same, we only count entries on
+ * We only count entries on
* a chain with equal hash inputs once
* so that entries for different QOS
* levels, and other non-hash input
* attributes don't unfairly skew
* the length computation
*/
- if ((*rthp == NULL) ||
- !compare_hash_inputs(&(*rthp)->fl,
- &rth->fl))
- length += ONE;
+ for (aux = rt_hash_table[i].chain;;) {
+ if (aux == rth) {
+ length += ONE;
+ break;
+ }
+ if (compare_hash_inputs(&aux->fl, &rth->fl))
+ break;
+ aux = aux->u.dst.rt_next;
+ }
continue;
}
- } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
- tmo >>= 1;
- rthp = &rth->u.dst.rt_next;
- if ((*rthp == NULL) ||
- !compare_hash_inputs(&(*rthp)->fl,
- &rth->fl))
- length += ONE;
- continue;
- }
+ } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout))
+ goto nofree;
/* Cleanup aged off entries. */
*rthp = rth->u.dst.rt_next;
@@ -1068,7 +1066,6 @@ out: return 0;
static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
{
struct rtable *rth, **rthp;
- struct rtable *rthi;
unsigned long now;
struct rtable *cand, **candp;
u32 min_score;
@@ -1088,7 +1085,6 @@ restart:
}
rthp = &rt_hash_table[hash].chain;
- rthi = NULL;
spin_lock_bh(rt_hash_lock_addr(hash));
while ((rth = *rthp) != NULL) {
@@ -1134,17 +1130,6 @@ restart:
chain_length++;
rthp = &rth->u.dst.rt_next;
-
- /*
- * check to see if the next entry in the chain
- * contains the same hash input values as rt. If it does
- * This is where we will insert into the list, instead of
- * at the head. This groups entries that differ by aspects not
- * relvant to the hash function together, which we use to adjust
- * our chain length
- */
- if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
- rthi = rth;
}
if (cand) {
@@ -1205,10 +1190,7 @@ restart:
}
}
- if (rthi)
- rt->u.dst.rt_next = rthi->u.dst.rt_next;
- else
- rt->u.dst.rt_next = rt_hash_table[hash].chain;
+ rt->u.dst.rt_next = rt_hash_table[hash].chain;
#if RT_CACHE_DEBUG >= 2
if (rt->u.dst.rt_next) {
@@ -1224,10 +1206,7 @@ restart:
* previous writes to rt are comitted to memory
* before making rt visible to other CPUS.
*/
- if (rthi)
- rcu_assign_pointer(rthi->u.dst.rt_next, rt);
- else
- rcu_assign_pointer(rt_hash_table[hash].chain, rt);
+ rcu_assign_pointer(rt_hash_table[hash].chain, rt);
spin_unlock_bh(rt_hash_lock_addr(hash));
*rp = rt;
next prev parent reply other threads:[~2009-05-19 18:48 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-19 2:35 Fw: [Bug 13339] New: rtable leak in ipv4/route.c Stephen Hemminger
2009-05-19 12:34 ` Jarek Poplawski
2009-05-19 15:12 ` Neil Horman
2009-05-19 15:32 ` Eric Dumazet
2009-05-19 16:20 ` Neil Horman
2009-05-19 18:47 ` Eric Dumazet [this message]
2009-05-19 19:24 ` Neil Horman
2009-05-19 22:05 ` David Miller
2009-05-19 23:05 ` Neil Horman
2009-05-20 4:54 ` [PATCH] net: fix length computation in rt_check_expire() Eric Dumazet
2009-05-20 6:13 ` David Miller
2009-05-20 6:14 ` [PATCH] net: fix rtable leak in net/ipv4/route.c Eric Dumazet
2009-05-20 10:03 ` Jarek Poplawski
2009-05-20 11:13 ` Eric Dumazet
2009-05-20 11:37 ` Jarek Poplawski
2009-05-20 10:48 ` Neil Horman
2009-05-21 0:19 ` David Miller
2009-05-20 10:27 ` [PATCH] net: fix length computation in rt_check_expire() Neil Horman
2009-05-21 0:19 ` David Miller
2009-05-19 16:23 ` Fw: [Bug 13339] New: rtable leak in ipv4/route.c Neil Horman
2009-05-19 17:17 ` Jarek Poplawski
2009-05-19 17:45 ` Neil Horman
2009-05-19 17:53 ` Jarek Poplawski
2009-05-19 18:05 ` Jarek Poplawski
2009-05-19 18:16 ` Neil Horman
2009-05-20 6:36 ` Alexander V. Lukyanov
2009-05-19 17:47 ` Jarek Poplawski
2009-05-19 17:22 ` Jarek Poplawski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A12FEDA.7040806@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=jarkao2@gmail.com \
--cc=lav@yar.ru \
--cc=netdev@vger.kernel.org \
--cc=nhorman@tuxdriver.com \
--cc=shemminger@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).