From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded Date: Tue, 30 Sep 2008 08:02:44 +0200 Message-ID: <48E1C104.2080801@cosmosbay.com> References: <20080929191254.GA20074@hmsreliant.think-freely.org> <48E138EB.1080001@cosmosbay.com> <20080929202731.GC20074@hmsreliant.think-freely.org> <48E141F3.9000903@cosmosbay.com> <20080929223801.GA3157@hmsreliant.think-freely.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: netdev@vger.kernel.org, kuznet@ms2.inr.ac.ru, davem@davemloft.net, pekkas@netcore.fi, jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net To: Neil Horman Return-path: Received: from smtp2f.orange.fr ([80.12.242.152]:42263 "EHLO smtp2f.orange.fr" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751158AbYI3GC4 convert rfc822-to-8bit (ORCPT ); Tue, 30 Sep 2008 02:02:56 -0400 In-Reply-To: <20080929223801.GA3157@hmsreliant.think-freely.org> Sender: netdev-owner@vger.kernel.org List-ID: Neil Horman a =E9crit : > On Mon, Sep 29, 2008 at 11:00:35PM +0200, Eric Dumazet wrote: >> Neil Horman a =E9crit : >>> On Mon, Sep 29, 2008 at 10:22:03PM +0200, Eric Dumazet wrote: >>>> Neil Horman a =E9crit : >>>>> Hey all- >>>>> We currently have the ability to disable our route cache secret = interval >>>>> rebuild timer (by setting it to zero), but if we do that its poss= ible for an >>>>> attacker (if they guess our route cache hash secret, to fill our = system with >>>>> routes that all hash to the same bucket, destroying our performan= ce. This patch >>>>> provides a backstop for that issues. In the event that our rebui= ld interval is >>>>> disabled (or very large), if any hash chain exceeds ip_rt_gc_elas= ticity, we do >>>>> an emergency hash rebuild. During the hash rebuild we: >>>>> 1) warn the user of the emergency >>>>> 2) disable the rebuild timer >>>>> 3) invalidate the route caches >>>>> 4) re-enable the rebuild timer with its old value >>>>> >>>>> Regards >>>>> Neil >>>> This sounds not good at all to me. >>>> >>>> 1) Dont set ip_rt_secret_interval to zero, this is plain silly, si= nce >>>> you give attackers infinite time to break your machine. >>>> >>>> To quote Herbert (who allowed to set this interval to 0) >>>> >>>> "Let me first state that disabling the route cache hash rebuild >>>> should not be done without extensive analysis on the risk prof= ile >>>> and careful deliberation. >>>> >>>> However, there are times when this can be done safely or for >>>> testing. For example, when you have mechanisms for ensuring >>>> that offending parties do not exist in your network." >>>> >>> Thats really rather the motivation behind this. The patch that Her= bert >>> submitted with that commit explicitly lets one disable their rebuil= d timer. I >>> agree its stupid to do that, but we added code to allow it. This p= rovides a >>> patch to help people who are victimized because they've done exactl= y this >>> (additionaly providing them a warning to stop doing it). >> Strange... many kernel parameters can be set to hazardous values tha= t make machine unusable... >> >> ip_rt_gc_interval can also be set to a very large value : No more ro= ute cache gc >> > If you want to talk philosophy, then I accept your premise: we can tu= ne the > kernel in ways that make it unusable. Does that mean we should avoid= doing > things to prevent that and maintain usibility. Ithink this patch acc= omplishes > that goal in its small way. > =20 >>> >>>> 2) Many machines have ip_rt_gc_elasticity set to 2, >>>> because they have a huge hash table, but low chain depths. >>> Ok, that seem reasonable, and this isn't going to disallow that. B= y the same >>> resoning, people who have huge hash tables, and low chain depths wo= n't >>> want their low chain length being violated, would they? This patch= will warn >>> them if their assumptions are being violated. >> Warn only ? If I read your patch, you not only warn in this case. > No, not warn only, Warn and correct is the clear intent here >=20 >> (you invalidate cache for each struct net, potentially wraping rt_ge= nid) >> > If you overflow your elasticity 2^16 times, yes (I think rt_genid is = a 16 bit > value, but I don't have the kernel in front of me). I would hope tha= t would be > enough warnings to make a sysadmin address the problem >=20 >> When you have 2^20 slots in route cache hash table, you dont care if= few slots have 3 or 4 elements. >> And chance is very high that more than one slot has 3 or even 4 elem= ents, no need for an attacker. > Ok, then I would assert if your ok with a few chains getting to be X = entries in > length, then you should set your elasticity to be X. Nope. We set elasticity to 2 when we want to trigger code starting at l= ine 1048 if (cand) { if (chain_length > ip_rt_gc_elasticity) { *candp =3D cand->u.dst.rt_next; rt_free(cand); } } That is, freeing one entry if chain length is > 2 AND one entry is free= able. This means that *some* chains eventually can have live entries and leng= th > 2 without any problem. This is not an emergency case. I have seen on real= =20 machines some chains hitting length of 13 (elasticity=3D2), that was no= rmal traffic, and rt cache was on equilibrium. Your patch adds to the "if (chain_length > elasticity)" case : If no entry is freeable in this slot, invalidate *all* cache and put a = warning. Invalidating live entries is puting a high pressure on dst_garbage.list= =2E Having 1.000.000 entries in this list is not very cool, and should be d= one only if really necessary. When you know you have about 1.000.000 live entries in rt cache, you can safely make your hash table sized to 2^20 slots, and elasticity set= to 2 so that average chain length is <=3D 2, reducing cache misses at lookup= time. That is quite important to be able to handle 100.000 packets per second >=20 >> Now if you change your code to something like >> >> if (unlikely(chain_length > some_quite_big_number && >> ip_rt_secret_interval =3D=3D 0)) { >> do_something(); >> } >> >> some_quite_big_number could be >=3D 30 or something... >> > I'd be ok with that, if some others chimed in with consensus on the m= atter. I > felt that we had a value definining what chain length should be > (ip_rt_gc_elasticity is already used in comparison on chain length in > rt_intern_hash, so I was keeping with that). But if some others spea= k up and > support a new sysctl, I can get behind that I said 30, but could have said 100. No need for a sysctl. If only one chain is really that long (and attacked), all its entries a= re hot. If many chains are hit, then other rt...sysctl_params will control and = limit cache grow. >=20 >> then it might be OK (at least it wont break common setups) >> > I don't think it will break many setups, the default value is 8 for t= his sysctl. > If someone has set it lower, and sees a warning, they won't loose the= ir system > altogether, they'll just see a momentary reduction in throughput, and= a > warning to increase the value they have set. Its going to far to say= anything > will 'break'. I spent many days so that route cache doesnt crash some big machines I = have around, I have feelings your patch will make them crawl again, unless sysadmin = is smart enough to change once again rt sysctls. So my replies are not just random things from me. When a machine is targeted by a DDOS attack, about all slots of the has= h table are fully loaded (ie chain length >=3D elasticity). We dont need to inv= alidate=20 the cache, but find an equilibrium, with small adjustements. Thank you