From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() Date: Tue, 24 Mar 2009 13:25:33 +0100 Message-ID: <49C8D13D.10307@cosmosbay.com> References: <49C77D71.8090709@trash.net> <49C780AD.70704@trash.net> <49C7CB9B.1040409@trash.net> <49C8A415.1090606@cosmosbay.com> <49C8CCF4.5050104@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: avorontsov@ru.mvista.com, Patrick McHardy , netdev@vger.kernel.org, "Paul E. McKenney" To: Joakim Tjernlund Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:41910 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755893AbZCXM0b convert rfc822-to-8bit (ORCPT ); Tue, 24 Mar 2009 08:26:31 -0400 In-Reply-To: <49C8CCF4.5050104@cosmosbay.com> Sender: netdev-owner@vger.kernel.org List-ID: Eric Dumazet a =E9crit : > Joakim Tjernlund a =E9crit : >> Eric Dumazet wrote on 24/03/2009 10:12:53: >>> Joakim Tjernlund a =E9crit : >>>> Patrick McHardy wrote on 23/03/2009 18:49:15: >>>>> Joakim Tjernlund wrote: >>>>>> Patrick McHardy wrote on 23/03/2009 13:29:33: >>>>>> >>>>>> >>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a >>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet >>>>>>>> to the board I see: >>>>>>>> >>>>>>> That means that something is leaking conntrack references, most= =20 >>>> likely >>>>>>> by leaking skbs. Since I haven't seen any other reports, my gue= ss=20 >>>> would >>>>>>> be the ucc_geth driver. >>>>>>> >>>>>> Mucking around with the ucc_geth driver I found that if I: >>>>>> - Move TX from IRQ to NAPI context >>>>>> - double the weight. >>>>>> - after booting up, wait a few mins until the JFFS2 GC kernel=20 >> thread=20 >>>> has=20 >>>>>> stopped >>>>>> scanning the FS=20 >>>>>> >>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops= =2E >>>>>> Does this seem right to you guys? >>>>> No. As I said, something seems to be leaking packets. You should = be >>>>> able to confirm that by checking the sk_buff slabs in /proc/slabi= nfo. >>>>> If that *doesn't* show any signs of a leak, please run "conntrack= -E" >>>>> to capture the conntrack events before the "table full" message >>>>> appears and post the output. >>>> skbuff does not differ much, but others do >>>> >>>> Before ping: >>>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54= 27=20 >> 0=20 >>>> : slabdata 0 0 0 >>>> skbuff_head_cache 20 20 192 20 1 : tunables 120= 60=20 >> 0=20 >>>> : slabdata 1 1 0 >>>> size-64 731 767 64 59 1 : tunables 120= 60=20 >> 0=20 >>>> : slabdata 13 13 0 >>>> nf_conntrack 10 19 208 19 1 : tunables 120= 60=20 >> 0=20 >>>> : slabdata 1 1 0 >>>> >>>> During ping:=20 >>>> skbuff_fclone_cache 0 0 352 11 1 : tunables 54= 27=20 >> 0=20 >>>> : slabdata 0 0 0 >>>> skbuff_head_cache 40 40 192 20 1 : tunables 120= 60=20 >> 0=20 >>>> : slabdata 2 2 0 >>>> size-64 8909 8909 64 59 1 : tunables 120= 60=20 >> 0=20 >>>> : slabdata 151 151 0 >>>> nf_conntrack 5111 5111 208 19 1 : tunables 120= 60=20 >> 0=20 >>>> : slabdata 269 269 0 >>>> >>>> This feels more like the freeing of conntrack objects are delayed = and=20 >>>> builds up when ping flooding. >>>> >>>> Don't have "conntrack -E" for my embedded board so that will have = to=20 >> wait=20 >>>> a bit longer. >>> I dont understand how your ping can use so many conntrack entries..= =2E >>> >>> Then, as I said yesterday, I believe you have a RCU delay, because = of >>> a misbehaving driver or something... >>> >>> grep RCU .config >> grep RCU .config >> # RCU Subsystem >> CONFIG_CLASSIC_RCU=3Dy >> # CONFIG_TREE_RCU is not set >> # CONFIG_PREEMPT_RCU is not set >> # CONFIG_TREE_RCU_TRACE is not set >> # CONFIG_PREEMPT_RCU_TRACE is not set >> # CONFIG_RCU_TORTURE_TEST is not set >> # CONFIG_RCU_CPU_STALL_DETECTOR is not set >> >>> grep CONFIG_SMP .config >> grep CONFIG_SMP .config >> # CONFIG_SMP is not set >> >>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c = (line=20 >> 80) >>> as a workaround. It should force a quiescent state after 1000 freed= =20 >> conntracks. >> >> right, doing this almost killed all conntrack messages, had to stres= s it=20 >> pretty >> hard before I saw handful "nf_conntrack: table full, dropping packet= " >> >> RCU is not my cup of tea, do you have any ideas were to look? >=20 > In a stress situation, you feed more deleted conntracks to call_rcu()= than > the blimit (10 real freeing per RCU softirq invocation).=20 >=20 > So with default qhimark being 10000, this means about 10000 conntrack= s > can sit in RCU (per CPU) before being really freed. >=20 > Only when hitting 10000, RCU enters a special mode to free all queued= items, instead > of a small batch of 10 >=20 > To solve your problem we can : >=20 > 1) reduce qhimark from 10000 to 1000 (for example) > Probably should be done to reduce some spikes in RCU code when fre= eing > whole 10000 elements... > OR > 2) change conntrack tunable (max conntrack entries on your machine) > OR > 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.coun= t > in nf_conntrack_free() instead of callback. >=20 > [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free() >=20 > We use RCU to defer freeing of conntrack structures. In DOS situation= , RCU might > accumulate about 10.000 elements per CPU in its internal queues. To g= et accurate > conntrack counts (at the expense of slightly more RAM used), we might= consider > conntrack counter not taking into account "about to be freed elements= , waiting > in RCU queues". We thus decrement it in nf_conntrack_free(), not in t= he RCU > callback. >=20 > Signed-off-by: Eric Dumazet >=20 >=20 > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_con= ntrack_core.c > index f4935e3..6478dc7 100644 > --- a/net/netfilter/nf_conntrack_core.c > +++ b/net/netfilter/nf_conntrack_core.c > @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc); > static void nf_conntrack_free_rcu(struct rcu_head *head) > { > struct nf_conn *ct =3D container_of(head, struct nf_conn, rcu); > - struct net *net =3D nf_ct_net(ct); > =20 > nf_ct_ext_free(ct); > kmem_cache_free(nf_conntrack_cachep, ct); > - atomic_dec(&net->ct.count); > } > =20 > void nf_conntrack_free(struct nf_conn *ct) > { > + struct net *net =3D nf_ct_net(ct); > + > nf_ct_ext_destroy(ct); > + atomic_dec(&net->ct.count); > call_rcu(&ct->rcu, nf_conntrack_free_rcu); > } > EXPORT_SYMBOL_GPL(nf_conntrack_free); I forgot to say this is what we do for 'struct file' freeing as well. W= e decrement nr_files in file_free(), not in file_free_rcu() static inline void file_free_rcu(struct rcu_head *head) { struct file *f =3D container_of(head, struct file, f_u.fu_rcuhe= ad); put_cred(f->f_cred); kmem_cache_free(filp_cachep, f); } static inline void file_free(struct file *f) { percpu_counter_dec(&nr_files); <<<< HERE >>>> file_check_state(f); call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); }