From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Date: Tue, 24 Mar 2009 14:29:29 +0100 Message-ID: <49C8E039.4040505@cosmosbay.com> References: <49C77D71.8090709@trash.net> <49C780AD.70704@trash.net> <49C7CB9B.1040409@trash.net> <49C8A415.1090606@cosmosbay.com> <49C8CCF4.5050104@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: avorontsov@ru.mvista.com, Patrick McHardy , netdev@vger.kernel.org, "Paul E. McKenney" To: Joakim Tjernlund Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:56258 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1760088AbZCXNbJ convert rfc822-to-8bit (ORCPT ); Tue, 24 Mar 2009 09:31:09 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: Joakim Tjernlund a =E9crit : > Eric Dumazet wrote on 24/03/2009 13:07:16: >> Joakim Tjernlund a =E9crit : >>> Eric Dumazet wrote on 24/03/2009 10:12:53: >>>> Joakim Tjernlund a =E9crit : >>>>> Patrick McHardy wrote on 23/03/2009 18:49:15: >>>>>> Joakim Tjernlund wrote: >>>>>>> Patrick McHardy wrote on 23/03/2009 13:29:33: >>>>>>> >>>>>>> >>>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a >>>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet >>>>>>>>> to the board I see: >>>>>>>>> >>>>>>>> That means that something is leaking conntrack references, mos= t=20 >>>>> likely >>>>>>>> by leaking skbs. Since I haven't seen any other reports, my gu= ess=20 >=20 >>>>> would >>>>>>>> be the ucc_geth driver. >>>>>>>> >>>>>>> Mucking around with the ucc_geth driver I found that if I: >>>>>>> - Move TX from IRQ to NAPI context >>>>>>> - double the weight. >>>>>>> - after booting up, wait a few mins until the JFFS2 GC kernel=20 >>> thread=20 >>>>> has=20 >>>>>>> stopped >>>>>>> scanning the FS=20 >>>>>>> >>>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stop= s. >>>>>>> Does this seem right to you guys? >>>>>> No. As I said, something seems to be leaking packets. You should= be >>>>>> able to confirm that by checking the sk_buff slabs in=20 > /proc/slabinfo. >>>>>> If that *doesn't* show any signs of a leak, please run "conntrac= k=20 > -E" >>>>>> to capture the conntrack events before the "table full" message >>>>>> appears and post the output. >>>>> skbuff does not differ much, but others do >>>>> >>>>> Before ping: >>>>> skbuff_fclone_cache 0 0 352 11 1 : tunables 5= 4=20 > 27=20 >>> 0=20 >>>>> : slabdata 0 0 0 >>>>> skbuff_head_cache 20 20 192 20 1 : tunables 12= 0=20 > 60=20 >>> 0=20 >>>>> : slabdata 1 1 0 >>>>> size-64 731 767 64 59 1 : tunables 12= 0=20 > 60=20 >>> 0=20 >>>>> : slabdata 13 13 0 >>>>> nf_conntrack 10 19 208 19 1 : tunables 12= 0=20 > 60=20 >>> 0=20 >>>>> : slabdata 1 1 0 >>>>> >>>>> During ping:=20 >>>>> skbuff_fclone_cache 0 0 352 11 1 : tunables 5= 4=20 > 27=20 >>> 0=20 >>>>> : slabdata 0 0 0 >>>>> skbuff_head_cache 40 40 192 20 1 : tunables 12= 0=20 > 60=20 >>> 0=20 >>>>> : slabdata 2 2 0 >>>>> size-64 8909 8909 64 59 1 : tunables 12= 0=20 > 60=20 >>> 0=20 >>>>> : slabdata 151 151 0 >>>>> nf_conntrack 5111 5111 208 19 1 : tunables 12= 0=20 > 60=20 >>> 0=20 >>>>> : slabdata 269 269 0 >>>>> >>>>> This feels more like the freeing of conntrack objects are delayed= =20 > and=20 >>>>> builds up when ping flooding. >>>>> >>>>> Don't have "conntrack -E" for my embedded board so that will have= to=20 >=20 >>> wait=20 >>>>> a bit longer. >>>> I dont understand how your ping can use so many conntrack entries.= =2E. >>>> >>>> Then, as I said yesterday, I believe you have a RCU delay, because= of >>>> a misbehaving driver or something... >>>> >>>> grep RCU .config >>> grep RCU .config >>> # RCU Subsystem >>> CONFIG_CLASSIC_RCU=3Dy >>> # CONFIG_TREE_RCU is not set >>> # CONFIG_PREEMPT_RCU is not set >>> # CONFIG_TREE_RCU_TRACE is not set >>> # CONFIG_PREEMPT_RCU_TRACE is not set >>> # CONFIG_RCU_TORTURE_TEST is not set >>> # CONFIG_RCU_CPU_STALL_DETECTOR is not set >>> >>>> grep CONFIG_SMP .config >>> grep CONFIG_SMP .config >>> # CONFIG_SMP is not set >>> >>>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c= =20 > (line=20 >>> 80) >>>> as a workaround. It should force a quiescent state after 1000 free= d=20 >>> conntracks. >>> >>> right, doing this almost killed all conntrack messages, had to stre= ss=20 > it=20 >>> pretty >>> hard before I saw handful "nf_conntrack: table full, dropping packe= t" >>> >>> RCU is not my cup of tea, do you have any ideas were to look? >> In a stress situation, you feed more deleted conntracks to call_rcu(= )=20 > than >> the blimit (10 real freeing per RCU softirq invocation).=20 >> >> So with default qhimark being 10000, this means about 10000 conntrac= ks >> can sit in RCU (per CPU) before being really freed. >> >> Only when hitting 10000, RCU enters a special mode to free all queue= d=20 > items, instead >> of a small batch of 10 >> >> To solve your problem we can : >> >> 1) reduce qhimark from 10000 to 1000 (for example) >> Probably should be done to reduce some spikes in RCU code when=20 > freeing >> whole 10000 elements... >> OR >> 2) change conntrack tunable (max conntrack entries on your machine) >> OR >> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.cou= nt >> in nf_conntrack_free() instead of callback. >> >> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() >=20 > The patch fixes the problem and the system feels a bit more responsiv= e=20 > too, thanks. > I guess I should probably do both 1) and 3) as my board is pretty slo= w=20 > too. >=20 > Been trying to figure out a good value for NAPI weigth too. Currently= my > HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move = TX=20 > processing > to NAPI context AND increase weigth to 32, the system is a lot more=20 > responsive during > ping flooding. Does weigth 32 make sense when the HW TX and RX queues= are=20 > 16? If you only have one NIC, I dont understand why changing weight should = make a difference. Are you referring to dev_weight or netdev_budget ? # cat /proc/sys/net/core/dev_weight 64 # cat /proc/sys/net/core/netdev_budget 300