From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [bug] __nf_ct_refresh_acct(): WARNING: at lib/list_debug.c:30 __list_add+0x7d/0xad() Date: Thu, 18 Jun 2009 17:46:11 +0200 Message-ID: <4A3A6143.3040607@gmail.com> References: <20090615.050449.144947903.davem@davemloft.net> <20090616091538.GA4184@elte.hu> <20090616.034752.226811527.davem@davemloft.net> <20090616105304.GA3579@elte.hu> <20090616122415.GA16630@elte.hu> <20090617092152.GA17449@elte.hu> <4A38C2F3.3000009@gmail.com> <20090617110803.GA10175@elte.hu> <20090618052356.GA18722@elte.hu> <4A39D778.9020607@cosmosbay.com> <4A3A0D45.8090806@trash.net> <4A3A5599.4080906@trash.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Ingo Molnar , David Miller , Thomas Gleixner , torvalds@linux-foundation.org, akpm@linux-foundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Pablo Neira Ayuso To: Patrick McHardy Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:42475 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752638AbZFRPqf (ORCPT ); Thu, 18 Jun 2009 11:46:35 -0400 In-Reply-To: <4A3A5599.4080906@trash.net> Sender: netdev-owner@vger.kernel.org List-ID: Patrick McHardy a =E9crit : > Patrick McHardy wrote: >> Eric Dumazet wrote: >>>> a test-box still triggered this crash overnight: >>>> >>>> [ 252.433471] ------------[ cut here ]------------ >>>> [ 252.436031] WARNING: at lib/list_debug.c:30 __list_add+0x95/0xa= 0() >>>> ... >>>> With your patch (repeated below) applied. Is Patrick's alternative >>>> patch supposed to fix something that yours does not? >>> >>> Hmm, not really, Patrick patch should fix same problem, but without >>> extra locking >>> as mine. >>> >>> This new stack trace is somewhat different, as corruption is detect= ed >>> in the add_timer() >>> call in __nf_conntrack_confirm() >>> >>> So there is another problem. CCed Pablo Neira Ayuso who added some = stuff >>> in netfilter and timeout logic recently. >> >> That timeout logic shouldn't be relevant in this case, its only >> activated when netlink event delivery is used, a userspace process >> is actually listening and it has the socket marked for reliable >> delivery. >> >> I think its still the same problem, the detection is just noticed >> at a different point. >=20 > I can't find anything wrong with the conntrack timer use itself. > Since its the timer base that is corrupted, its possible that > something else is using timers incorrectly and conntrack just > happens to hit the problem afterwards, but it seems rather > unlikely since the first backtrace originated in a network driver, > the second one in the socket layer. >=20 > But I've noticed a different problem, the lockless lookup might > incorrectly return an conntrack entry that is supposed to be > invisible. This can happen because we only check for equal tuples > and !atomic_inc_not_zero(refcnt), but the conntrack can be removed > from the lists and still have references to it. It should never > have an active timer at that point however, so all following > mod_timer_pending() calls *should* do nothing unless I'm missing > something. >=20 > Ingo, could you please try whether this patch (combined with the > last one) makes any difference? Enabling CONFIG_NETFILTER_DEBUG > could also help. >=20 >=20 Interesting ! In my own analysis, I found death_by_timeout() might be problematic, with RCU and lockless lookups. static void death_by_timeout(unsigned long ul_conntrack) { struct nf_conn *ct =3D (void *)ul_conntrack; if (!test_bit(IPS_DYING_BIT, &ct->status) && unlikely(nf_conntrack_event(IPCT_DESTROY, ct) < 0)) { /* destroy event was not delivered */ nf_ct_delete_from_lists(ct); << HERE >> nf_ct_insert_dying_list(ct); return; } set_bit(IPS_DYING_BIT, &ct->status); nf_ct_delete_from_lists(ct); nf_ct_put(ct); } We delete ct from a list and insert it in a new list. I believe a reader could "*catch*" ct while doing a lookup and miss the= end of its chain. (nulls algo check the null value at the end of lookup and= can decide to restart the lookup if the null value is not the expected one) We need to change nf_conntrack_init_net() and use a different "null" va= lue, guaranteed not being used in regular lists Patch follows : diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_connt= rack_core.c index 5f72b94..5276a2d 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1267,13 +1267,19 @@ err_cache: return ret; } =20 +/* + * We need to use special "null" values, not used in hash table + */ +#define UNCONFIRMED_NULLS_VAL ((1<<30)+0) +#define DYING_NULLS_VAL ((1<<30)+1) + static int nf_conntrack_init_net(struct net *net) { int ret; =20 atomic_set(&net->ct.count, 0); - INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, 0); - INIT_HLIST_NULLS_HEAD(&net->ct.dying, 0); + INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, UNCONFIRMED_NULLS_VAL); + INIT_HLIST_NULLS_HEAD(&net->ct.dying, DYING_NULLS_VAL); net->ct.stat =3D alloc_percpu(struct ip_conntrack_stat); if (!net->ct.stat) { ret =3D -ENOMEM;