From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: [PATCH] net: fix rtable leak in net/ipv4/route.c Date: Wed, 20 May 2009 08:14:28 +0200 Message-ID: <4A139FC4.7030309@cosmosbay.com> References: <20090519162048.GB28034@hmsreliant.think-freely.org> <4A12FEDA.7040806@cosmosbay.com> <20090519192450.GF28034@hmsreliant.think-freely.org> <20090519.150517.62361946.davem@davemloft.net> <4A138CFE.5070901@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: nhorman@tuxdriver.com, jarkao2@gmail.com, lav@yar.ru, shemminger@linux-foundation.org, netdev@vger.kernel.org To: David Miller Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:37102 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751680AbZETGOn convert rfc822-to-8bit (ORCPT ); Wed, 20 May 2009 02:14:43 -0400 In-Reply-To: <4A138CFE.5070901@cosmosbay.com> Sender: netdev-owner@vger.kernel.org List-ID: Eric Dumazet a =E9crit : > David Miller a =E9crit : >> From: Neil Horman >> Date: Tue, 19 May 2009 15:24:50 -0400 >> >>>> Moving whole group in front would defeat the purpose of move, actu= ally, >>>> since rank in chain is used to decay the timeout in garbage collec= tor. >>>> (search for tmo >>=3D 1; ) >>>> >>> Argh, so the list is implicitly ordered by expiration time. That >>> really defeats the entire purpose of doing grouping in the ilst at >>> all. If thats the case, then I agree, its probably better to to >>> take the additional visitation hit in in check_expire above than to >>> try and preserve ordering. >> Yes, this seems best. >> >> I was worried that somehow the ordering also influences lookups, >> because the TOS bits don't go into the hash so I worried that it wou= ld >> be important that explicit TOS values appear before wildcard ones. >> But it doesn't appear that this is an issue, we don't have wildcard >> TOSs in the rtable entries, they are always explicit. >> >> So I would like to see an explicit final patch from Eric so we can g= et >> this fixed now. >> >=20 > I would like to split patches because we have two bugs indeed, and > I prefer to get attention for both problems, I dont remember Neil ack= nowledged > the length computation problem. >=20 > First and small patch, candidate for net-2.6 and stable (for 2.6.29) = : >=20 Then here is the patch on top on previous one. Thanks to all [PATCH] net: fix rtable leak in net/ipv4/route.c Alexander V. Lukyanov found a regression in 2.6.29 and made a complete analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=3D13339 Quoted here because its a perfect one : begin_of_quotation=20 2.6.29 patch has introduced flexible route cache rebuilding. Unfortuna= tely the patch has at least one critical flaw, and another problem. rt_intern_hash calculates rthi pointer, which is later used for new en= try insertion. The same loop calculates cand pointer which is used to clea= n the list. If the pointers are the same, rtable leak occurs, as first the c= and is removed then the new entry is appended to it. This leak leads to unregister_netdevice problem (usage count > 0). Another problem of the patch is that it tries to insert the entries in= certain order, to facilitate counting of entries distinct by all but QoS param= eters. Unfortunately, referencing an existing rtable entry moves it to list b= eginning, to speed up further lookups, so the carefully built order is destroyed= =2E For the first problem the simplest patch it to set rthi=3D0 when rthi=3D= =3Dcand, but it will also destroy the ordering. end_of_quotation Problematic commit is 1080d709fb9d8cd4392f93476ee46a9d6ea05a5b (net: implement emergency route cache rebulds when gc_elasticity is exc= eeded) Trying to keep dst_entries ordered is too complex and breaks the fact t= hat order should depend on the frequency of use for garbage collection. A possible fix is to make rt_intern_hash() simpler, and only makes rt_check_expire() a litle bit smarter, being able to cope with an arbit= rary entries order. The added loop is running on cache hot data, while cpu is prefetching next object, so should be unnoticied. Reported-and-analyzed-by: Alexander V. Lukyanov Signed-off-by: Eric Dumazet --- net/ipv4/route.c | 55 +++++++++++++-------------------------------- 1 files changed, 17 insertions(+), 38 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 869cf1c..28205e5 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -784,7 +784,7 @@ static void rt_check_expire(void) { static unsigned int rover; unsigned int i =3D rover, goal; - struct rtable *rth, **rthp; + struct rtable *rth, *aux, **rthp; unsigned long samples =3D 0; unsigned long sum =3D 0, sum2 =3D 0; u64 mult; @@ -812,6 +812,7 @@ static void rt_check_expire(void) length =3D 0; spin_lock_bh(rt_hash_lock_addr(i)); while ((rth =3D *rthp) !=3D NULL) { + prefetch(rth->u.dst.rt_next); if (rt_is_expired(rth)) { *rthp =3D rth->u.dst.rt_next; rt_free(rth); @@ -820,33 +821,30 @@ static void rt_check_expire(void) if (rth->u.dst.expires) { /* Entry is expired even if it is in use */ if (time_before_eq(jiffies, rth->u.dst.expires)) { +nofree: tmo >>=3D 1; rthp =3D &rth->u.dst.rt_next; /* - * Only bump our length if the hash - * inputs on entries n and n+1 are not - * the same, we only count entries on + * We only count entries on * a chain with equal hash inputs once * so that entries for different QOS * levels, and other non-hash input * attributes don't unfairly skew * the length computation */ - if ((*rthp =3D=3D NULL) || - !compare_hash_inputs(&(*rthp)->fl, - &rth->fl)) - length +=3D ONE; + for (aux =3D rt_hash_table[i].chain;;) { + if (aux =3D=3D rth) { + length +=3D ONE; + break; + } + if (compare_hash_inputs(&aux->fl, &rth->fl)) + break; + aux =3D aux->u.dst.rt_next; + } continue; } - } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) { - tmo >>=3D 1; - rthp =3D &rth->u.dst.rt_next; - if ((*rthp =3D=3D NULL) || - !compare_hash_inputs(&(*rthp)->fl, - &rth->fl)) - length +=3D ONE; - continue; - } + } else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) + goto nofree; =20 /* Cleanup aged off entries. */ *rthp =3D rth->u.dst.rt_next; @@ -1069,7 +1067,6 @@ out: return 0; static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rta= ble **rp) { struct rtable *rth, **rthp; - struct rtable *rthi; unsigned long now; struct rtable *cand, **candp; u32 min_score; @@ -1089,7 +1086,6 @@ restart: } =20 rthp =3D &rt_hash_table[hash].chain; - rthi =3D NULL; =20 spin_lock_bh(rt_hash_lock_addr(hash)); while ((rth =3D *rthp) !=3D NULL) { @@ -1135,17 +1131,6 @@ restart: chain_length++; =20 rthp =3D &rth->u.dst.rt_next; - - /* - * check to see if the next entry in the chain - * contains the same hash input values as rt. If it does - * This is where we will insert into the list, instead of - * at the head. This groups entries that differ by aspects not - * relvant to the hash function together, which we use to adjust - * our chain length - */ - if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl)) - rthi =3D rth; } =20 if (cand) { @@ -1206,10 +1191,7 @@ restart: } } =20 - if (rthi) - rt->u.dst.rt_next =3D rthi->u.dst.rt_next; - else - rt->u.dst.rt_next =3D rt_hash_table[hash].chain; + rt->u.dst.rt_next =3D rt_hash_table[hash].chain; =20 #if RT_CACHE_DEBUG >=3D 2 if (rt->u.dst.rt_next) { @@ -1225,10 +1207,7 @@ restart: * previous writes to rt are comitted to memory * before making rt visible to other CPUS. */ - if (rthi) - rcu_assign_pointer(rthi->u.dst.rt_next, rt); - else - rcu_assign_pointer(rt_hash_table[hash].chain, rt); + rcu_assign_pointer(rt_hash_table[hash].chain, rt); =20 spin_unlock_bh(rt_hash_lock_addr(hash)); *rp =3D rt;