From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Date: Tue, 24 Mar 2009 13:07:16 +0100 Message-ID: <49C8CCF4.5050104@cosmosbay.com> References: <49C77D71.8090709@trash.net> <49C780AD.70704@trash.net> <49C7CB9B.1040409@trash.net> <49C8A415.1090606@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: avorontsov@ru.mvista.com, Patrick McHardy , netdev@vger.kernel.org, "Paul E. McKenney" To: Joakim Tjernlund Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:46511 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752179AbZCXMIO convert rfc822-to-8bit (ORCPT ); Tue, 24 Mar 2009 08:08:14 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: Joakim Tjernlund a =E9crit : > Eric Dumazet wrote on 24/03/2009 10:12:53: >> Joakim Tjernlund a =E9crit : >>> Patrick McHardy wrote on 23/03/2009 18:49:15: >>>> Joakim Tjernlund wrote: >>>>> Patrick McHardy wrote on 23/03/2009 13:29:33: >>>>> >>>>> >>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a >>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet >>>>>>> to the board I see: >>>>>>> >>>>>> That means that something is leaking conntrack references, most=20 >>> likely >>>>>> by leaking skbs. Since I haven't seen any other reports, my gues= s=20 >>> would >>>>>> be the ucc_geth driver. >>>>>> >>>>> Mucking around with the ucc_geth driver I found that if I: >>>>> - Move TX from IRQ to NAPI context >>>>> - double the weight. >>>>> - after booting up, wait a few mins until the JFFS2 GC kernel=20 > thread=20 >>> has=20 >>>>> stopped >>>>> scanning the FS=20 >>>>> >>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops. >>>>> Does this seem right to you guys? >>>> No. As I said, something seems to be leaking packets. You should b= e >>>> able to confirm that by checking the sk_buff slabs in /proc/slabin= fo. >>>> If that *doesn't* show any signs of a leak, please run "conntrack = -E" >>>> to capture the conntrack events before the "table full" message >>>> appears and post the output. >>> skbuff does not differ much, but others do >>> >>> Before ping: >>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 = 27=20 > 0=20 >>> : slabdata 0 0 0 >>> skbuff_head_cache 20 20 192 20 1 : tunables 120 = 60=20 > 0=20 >>> : slabdata 1 1 0 >>> size-64 731 767 64 59 1 : tunables 120 = 60=20 > 0=20 >>> : slabdata 13 13 0 >>> nf_conntrack 10 19 208 19 1 : tunables 120 = 60=20 > 0=20 >>> : slabdata 1 1 0 >>> >>> During ping:=20 >>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54 = 27=20 > 0=20 >>> : slabdata 0 0 0 >>> skbuff_head_cache 40 40 192 20 1 : tunables 120 = 60=20 > 0=20 >>> : slabdata 2 2 0 >>> size-64 8909 8909 64 59 1 : tunables 120 = 60=20 > 0=20 >>> : slabdata 151 151 0 >>> nf_conntrack 5111 5111 208 19 1 : tunables 120 = 60=20 > 0=20 >>> : slabdata 269 269 0 >>> >>> This feels more like the freeing of conntrack objects are delayed a= nd=20 >>> builds up when ping flooding. >>> >>> Don't have "conntrack -E" for my embedded board so that will have t= o=20 > wait=20 >>> a bit longer. >> I dont understand how your ping can use so many conntrack entries... >> >> Then, as I said yesterday, I believe you have a RCU delay, because o= f >> a misbehaving driver or something... >> >> grep RCU .config > grep RCU .config > # RCU Subsystem > CONFIG_CLASSIC_RCU=3Dy > # CONFIG_TREE_RCU is not set > # CONFIG_PREEMPT_RCU is not set > # CONFIG_TREE_RCU_TRACE is not set > # CONFIG_PREEMPT_RCU_TRACE is not set > # CONFIG_RCU_TORTURE_TEST is not set > # CONFIG_RCU_CPU_STALL_DETECTOR is not set >=20 >> grep CONFIG_SMP .config > grep CONFIG_SMP .config > # CONFIG_SMP is not set >=20 >> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (= line=20 > 80) >> as a workaround. It should force a quiescent state after 1000 freed=20 > conntracks. >=20 > right, doing this almost killed all conntrack messages, had to stress= it=20 > pretty > hard before I saw handful "nf_conntrack: table full, dropping packet" >=20 > RCU is not my cup of tea, do you have any ideas were to look? In a stress situation, you feed more deleted conntracks to call_rcu() t= han the blimit (10 real freeing per RCU softirq invocation).=20 So with default qhimark being 10000, this means about 10000 conntracks can sit in RCU (per CPU) before being really freed. Only when hitting 10000, RCU enters a special mode to free all queued i= tems, instead of a small batch of 10 To solve your problem we can : 1) reduce qhimark from 10000 to 1000 (for example) Probably should be done to reduce some spikes in RCU code when freei= ng whole 10000 elements... OR 2) change conntrack tunable (max conntrack entries on your machine) OR 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count in nf_conntrack_free() instead of callback. [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() We use RCU to defer freeing of conntrack structures. In DOS situation, = RCU might accumulate about 10.000 elements per CPU in its internal queues. To get= accurate conntrack counts (at the expense of slightly more RAM used), we might c= onsider conntrack counter not taking into account "about to be freed elements, = waiting in RCU queues". We thus decrement it in nf_conntrack_free(), not in the= RCU callback. Signed-off-by: Eric Dumazet diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_connt= rack_core.c index f4935e3..6478dc7 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc); static void nf_conntrack_free_rcu(struct rcu_head *head) { struct nf_conn *ct =3D container_of(head, struct nf_conn, rcu); - struct net *net =3D nf_ct_net(ct); =20 nf_ct_ext_free(ct); kmem_cache_free(nf_conntrack_cachep, ct); - atomic_dec(&net->ct.count); } =20 void nf_conntrack_free(struct nf_conn *ct) { + struct net *net =3D nf_ct_net(ct); + nf_ct_ext_destroy(ct); + atomic_dec(&net->ct.count); call_rcu(&ct->rcu, nf_conntrack_free_rcu); } EXPORT_SYMBOL_GPL(nf_conntrack_free);