* Kernel panic nf_nat_setup_info+0x5b3/0x6e0 @ 2011-02-23 17:07 "Oleg A. Arkhangelsky" 2011-03-02 11:37 ` Patrick McHardy 0 siblings, 1 reply; 12+ messages in thread From: "Oleg A. Arkhangelsky" @ 2011-02-23 17:07 UTC (permalink / raw) To: netfilter-devel; +Cc: netdev Hello, Got this panic yesterday: http://www.progtech.ru/~oleg/crash.txt The offending instruction is: cmpb 54(%edx), %cl # <variable>.tuple.dst.protonum, and here is the assembler code of net/ipv4/netfilter/nf_nat_core.c: http://www.progtech.ru/~oleg/nf_nat_core.s Quick investigation lead me to conclusion that the problem is in return of same_src function: return (t->dst.protonum == tuple->dst.protonum && t->src.u3.ip == tuple->src.u3.ip && t->src.u.all == tuple->src.u.all); So either t or tuple pointer is bad, but I don't understand how this can be. Looks like the similar situation described here: https://bugzilla.kernel.org/show_bug.cgi?id=21512 Any thoughts on this? Thank you! -- wbr, Oleg. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-02-23 17:07 Kernel panic nf_nat_setup_info+0x5b3/0x6e0 "Oleg A. Arkhangelsky" @ 2011-03-02 11:37 ` Patrick McHardy 2011-03-02 14:37 ` Changli Gao 0 siblings, 1 reply; 12+ messages in thread From: Patrick McHardy @ 2011-03-02 11:37 UTC (permalink / raw) To: Oleg A. Arkhangelsky; +Cc: netfilter-devel, netdev Am 23.02.2011 18:07, schrieb "Oleg A. Arkhangelsky": > Hello, > > Got this panic yesterday: > http://www.progtech.ru/~oleg/crash.txt > > The offending instruction is: > cmpb 54(%edx), %cl # <variable>.tuple.dst.protonum, > > and here is the assembler code of net/ipv4/netfilter/nf_nat_core.c: > http://www.progtech.ru/~oleg/nf_nat_core.s > > Quick investigation lead me to conclusion that the problem is in > return of same_src function: > > return (t->dst.protonum == tuple->dst.protonum && > t->src.u3.ip == tuple->src.u3.ip && > t->src.u.all == tuple->src.u.all); > > So either t or tuple pointer is bad, but I don't understand how > this can be. I'm not sure myself, I'm guessing it has something to do with reallocation of the NAT extension area. Please post your full ruleset and any helpers in use. > [2971152.752502] Pid: 0, comm: swapper Not tainted (2.6.32.25-pt #1) Also please try whether the problem still happens with the current kernel version. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-03-02 11:37 ` Patrick McHardy @ 2011-03-02 14:37 ` Changli Gao 2011-03-02 19:50 ` "Oleg A. Arkhangelsky" 0 siblings, 1 reply; 12+ messages in thread From: Changli Gao @ 2011-03-02 14:37 UTC (permalink / raw) To: Patrick McHardy Cc: Oleg A. Arkhangelsky, netfilter-devel, netdev, Paul E McKenney On Wed, Mar 2, 2011 at 7:37 PM, Patrick McHardy <kaber@trash.net> wrote: > Am 23.02.2011 18:07, schrieb "Oleg A. Arkhangelsky": >> Hello, >> >> Got this panic yesterday: >> http://www.progtech.ru/~oleg/crash.txt >> >> The offending instruction is: >> cmpb 54(%edx), %cl # <variable>.tuple.dst.protonum, >> >> and here is the assembler code of net/ipv4/netfilter/nf_nat_core.c: >> http://www.progtech.ru/~oleg/nf_nat_core.s >> >> Quick investigation lead me to conclusion that the problem is in >> return of same_src function: >> >> return (t->dst.protonum == tuple->dst.protonum && >> t->src.u3.ip == tuple->src.u3.ip && >> t->src.u.all == tuple->src.u.all); >> >> So either t or tuple pointer is bad, but I don't understand how >> this can be. > t should be NULL here, as offsetof(struct nf_conn, dst.protonum) == 0x36. We should free the nf_ct_extend with call_rcu(), since nat ext is referenced in the rcu read context. 717 void nf_conntrack_free(struct nf_conn *ct) 718 { 719 struct net *net = nf_ct_net(ct); 720 721 nf_ct_ext_destroy(ct); 722 atomic_dec(&net->ct.count); 723 nf_ct_ext_free(ct); if (ct->ext) call_rcu(&ct->ext->rcu, __nf_ct_ext_free_rcu); 724 kmem_cache_free(net->ct.nf_conntrack_cachep, ct); 725 } 726 EXPORT_SYMBOL_GPL(nf_conntrack_free); -- Regards, Changli Gao(xiaosuo@gmail.com) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-03-02 14:37 ` Changli Gao @ 2011-03-02 19:50 ` "Oleg A. Arkhangelsky" 2011-03-03 7:33 ` Changli Gao 0 siblings, 1 reply; 12+ messages in thread From: "Oleg A. Arkhangelsky" @ 2011-03-02 19:50 UTC (permalink / raw) To: Changli Gao; +Cc: Patrick McHardy, netfilter-devel, netdev, Paul E McKenney 02.03.2011, 17:37, "Changli Gao" <xiaosuo@gmail.com>: > t should be NULL here, as offsetof(struct nf_conn, dst.protonum) == 0x36. > We should free the nf_ct_extend with call_rcu(), since nat ext is > referenced in the rcu read context. Yes, I think the problem is triggered when nf_conntrack_free() is called by different CPU during net->ipv4.nat_bysource hash traversal. Extensions framework doesn't have any SLAB_DESTROY_BY_RCU magic. I'm not sure, but couldn't this problem be introduced by: ea781f197d6a835cbb93a0bf88ee1696296ed8aa netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu() ? -- wbr, Oleg. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-03-02 19:50 ` "Oleg A. Arkhangelsky" @ 2011-03-03 7:33 ` Changli Gao 2011-03-26 15:44 ` Changli Gao 0 siblings, 1 reply; 12+ messages in thread From: Changli Gao @ 2011-03-03 7:33 UTC (permalink / raw) To: Oleg A. Arkhangelsky Cc: Patrick McHardy, netfilter-devel, netdev, Paul E McKenney [-- Attachment #1: Type: text/plain, Size: 1461 bytes --] On Thu, Mar 3, 2011 at 3:50 AM, "Oleg A. Arkhangelsky" <sysoleg@yandex.ru> wrote: > 02.03.2011, 17:37, "Changli Gao" <xiaosuo@gmail.com>: > >> t should be NULL here, as offsetof(struct nf_conn, dst.protonum) == 0x36. >> We should free the nf_ct_extend with call_rcu(), since nat ext is >> referenced in the rcu read context. > > Yes, I think the problem is triggered when nf_conntrack_free() is called by > different CPU during net->ipv4.nat_bysource hash traversal. Extensions > framework doesn't have any SLAB_DESTROY_BY_RCU magic. > > I'm not sure, but couldn't this problem be introduced by: > > ea781f197d6a835cbb93a0bf88ee1696296ed8aa > netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu() > > ? > There is nothing to do with SLAB_DESTROY_BY_RCU. Here is the comment for this flag: /* * SLAB_DESTROY_BY_RCU - **WARNING** READ THIS! * * This delays freeing the SLAB page by a grace period, it does _NOT_ * delay object freeing. This means that if you do kmem_cache_free() * that memory location is free to be reused at any time. Thus it may * be possible to see another object there in the same RCU grace period. * * This feature only ensures the memory location backing the object * stays valid, the trick to using this is relying on an independent * object validation pass. Something like: ... Please try the patch attached and test if the problem is solved or not. Thanks. -- Regards, Changli Gao(xiaosuo@gmail.com) [-- Attachment #2: nf_ext_rcu.diff --] [-- Type: text/plain, Size: 1384 bytes --] diff --git a/include/net/netfilter/nf_conntrack_extend.h b/include/net/netfilter/nf_conntrack_extend.h index 2dcf317..354cccb9 100644 --- a/include/net/netfilter/nf_conntrack_extend.h +++ b/include/net/netfilter/nf_conntrack_extend.h @@ -66,13 +66,15 @@ static inline void nf_ct_ext_destroy(struct nf_conn *ct) __nf_ct_ext_destroy(ct); } +void __nf_ct_ext_free_rcu(struct rcu_head *head); + /* Free operation. If you want to free a object referred from private area, * please implement __nf_ct_ext_free() and call it. */ static inline void nf_ct_ext_free(struct nf_conn *ct) { if (ct->ext) - kfree(ct->ext); + call_rcu(&ct->ext->rcu, __nf_ct_ext_free_rcu); } /* Add this type, returns pointer to data or NULL. */ diff --git a/net/netfilter/nf_conntrack_extend.c b/net/netfilter/nf_conntrack_extend.c index 80a23ed..3a47b76 100644 --- a/net/netfilter/nf_conntrack_extend.c +++ b/net/netfilter/nf_conntrack_extend.c @@ -68,11 +68,12 @@ nf_ct_ext_create(struct nf_ct_ext **ext, enum nf_ct_ext_id id, gfp_t gfp) return (void *)(*ext) + off; } -static void __nf_ct_ext_free_rcu(struct rcu_head *head) +void __nf_ct_ext_free_rcu(struct rcu_head *head) { struct nf_ct_ext *ext = container_of(head, struct nf_ct_ext, rcu); kfree(ext); } +EXPORT_SYMBOL_GPL(__nf_ct_ext_free_rcu); void *__nf_ct_ext_add(struct nf_conn *ct, enum nf_ct_ext_id id, gfp_t gfp) { ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-03-03 7:33 ` Changli Gao @ 2011-03-26 15:44 ` Changli Gao 2011-03-31 14:03 ` "Oleg A. Arkhangelsky" 0 siblings, 1 reply; 12+ messages in thread From: Changli Gao @ 2011-03-26 15:44 UTC (permalink / raw) To: Oleg A. Arkhangelsky Cc: Patrick McHardy, netfilter-devel, netdev, Paul E McKenney On Thu, Mar 3, 2011 at 3:33 PM, Changli Gao <xiaosuo@gmail.com> wrote: > Please try the patch attached and test if the problem is solved or not. Thanks. > Any feedback? Thanks. -- Regards, Changli Gao(xiaosuo@gmail.com) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-03-26 15:44 ` Changli Gao @ 2011-03-31 14:03 ` "Oleg A. Arkhangelsky" 2011-03-31 14:47 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: "Oleg A. Arkhangelsky" @ 2011-03-31 14:03 UTC (permalink / raw) To: Changli Gao; +Cc: Patrick McHardy, netfilter-devel, netdev, Paul E McKenney 26.03.2011, 16:44, "Changli Gao" <xiaosuo@gmail.com>: > On Thu, Mar 3, 2011 at 3:33 PM, Changli Gao <xiaosuo@gmail.com>; wrote: > >> Please try the patch attached and test if the problem is solved or not. Thanks. > > Any feedback? Thanks. > Seems that patch is fine. https://bugzilla.kernel.org/show_bug.cgi?id=21512 -- wbr, Oleg. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-03-31 14:03 ` "Oleg A. Arkhangelsky" @ 2011-03-31 14:47 ` Eric Dumazet 2011-04-01 2:02 ` Changli Gao 2011-04-05 11:49 ` Patrick McHardy 0 siblings, 2 replies; 12+ messages in thread From: Eric Dumazet @ 2011-03-31 14:47 UTC (permalink / raw) To: "Oleg A. Arkhangelsky" Cc: Changli Gao, Patrick McHardy, netfilter-devel, netdev, Paul E McKenney Le jeudi 31 mars 2011 à 18:03 +0400, "Oleg A. Arkhangelsky" a écrit : > > 26.03.2011, 16:44, "Changli Gao" <xiaosuo@gmail.com>: > > On Thu, Mar 3, 2011 at 3:33 PM, Changli Gao <xiaosuo@gmail.com>; wrote: > > > >> Please try the patch attached and test if the problem is solved or not. Thanks. > > > > Any feedback? Thanks. > > > > Seems that patch is fine. > > https://bugzilla.kernel.org/show_bug.cgi?id=21512 > I wonder if this is not hiding another bug. Adding an RCU grace period might reduce the probability window. By the time nf_conntrack_free(ct) is called, no other cpu/thread could/should use ct, or ct->ext ? Sure, another thread can find/pass_on ct in a lookup but should not use it, since its refcount (ct_general.use) should be 0. Patrick ? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-03-31 14:47 ` Eric Dumazet @ 2011-04-01 2:02 ` Changli Gao 2011-04-01 5:30 ` Eric Dumazet 2011-04-05 11:49 ` Patrick McHardy 1 sibling, 1 reply; 12+ messages in thread From: Changli Gao @ 2011-04-01 2:02 UTC (permalink / raw) To: Eric Dumazet Cc: "Oleg A. Arkhangelsky", Patrick McHardy, netfilter-devel, netdev, Paul E McKenney On Thu, Mar 31, 2011 at 10:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > I wonder if this is not hiding another bug. > > Adding an RCU grace period might reduce the probability window. What bug? This one? > > By the time nf_conntrack_free(ct) is called, no other cpu/thread > could/should use ct, or ct->ext ? nat->ct may refer it. > > Sure, another thread can find/pass_on ct in a lookup but should not use > it, since its refcount (ct_general.use) should be 0. > > As SLAB_DESTROY_BY_RCU is used, we should validate ct->orig_tuple too. There is another minor problem. nf_nat_core.c: 133 rcu_read_lock(); 134 hlist_for_each_entry_rcu(nat, n, &net->ipv4.nat_bysource[h], bysourc e) { 135 ct = nat->ct; 136 if (same_src(ct, tuple) && nf_ct_zone(ct) == zone) { 137 /* Copy source part from reply tuple. */ 138 nf_ct_invert_tuplepr(result, 139 &ct->tuplehash[IP_CT_DIR_REPLY].tuple ); 140 result->dst = tuple->dst; 141 142 if (in_range(result, range)) { 143 rcu_read_unlock(); 144 return 1; 145 } 146 } 147 } 148 rcu_read_unlock(); If the ct is reused, NAT mapping will be wrong. It isn't a serious problem, and can't be fixed, even though we check the reference counter before using it, but we can't validate it with the original tuple. Maybe we can do it in this way ct = nat->ct; if (!nf_ct_is_dying(ct) && atomic_inc_not_zero(&ct->ct_general.use))) { if (nf_ct_ext_find(ct, NF_CT_EXT_NAT) == nat) { /* ct is valid, do sth... */ } nf_ct_put(ct); } I think two additional atomic operations are expensive. It isn't a good idea. -- Regards, Changli Gao(xiaosuo@gmail.com) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-04-01 2:02 ` Changli Gao @ 2011-04-01 5:30 ` Eric Dumazet 0 siblings, 0 replies; 12+ messages in thread From: Eric Dumazet @ 2011-04-01 5:30 UTC (permalink / raw) To: Changli Gao Cc: "Oleg A. Arkhangelsky", Patrick McHardy, netfilter-devel, netdev, Paul E McKenney Le vendredi 01 avril 2011 à 10:02 +0800, Changli Gao a écrit : > On Thu, Mar 31, 2011 at 10:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > I wonder if this is not hiding another bug. > > > > Adding an RCU grace period might reduce the probability window. > > What bug? This one? > Yes. I am saying your patch is a brown paper bag. Real bug is elsewhere, and we must find it, not make it happen less frequently. Maybe its a missing barrier, and adding a full RCU grace period is a waste (of cpu caches space, since call_rcu() fill a potential big list) > > > > By the time nf_conntrack_free(ct) is called, no other cpu/thread > > could/should use ct, or ct->ext ? > > nat->ct may refer it. > conntrack must have a real problem of refcounting then. Each time an object A has a reference to object B, we should increase B refcount, so B cannot disappear. nat->ct _cannot_ refer ct if we are freeing ct. Thats quite simple rule. > > > > Sure, another thread can find/pass_on ct in a lookup but should not use > > it, since its refcount (ct_general.use) should be 0. > > > > > > As SLAB_DESTROY_BY_RCU is used, we should validate ct->orig_tuple too. > > There is another minor problem. > minor or serious ? > nf_nat_core.c: > > 133 rcu_read_lock(); > 134 hlist_for_each_entry_rcu(nat, n, > &net->ipv4.nat_bysource[h], bysourc e) { > 135 ct = nat->ct; > 136 if (same_src(ct, tuple) && nf_ct_zone(ct) == zone) { > 137 /* Copy source part from reply tuple. */ > 138 nf_ct_invert_tuplepr(result, > 139 > &ct->tuplehash[IP_CT_DIR_REPLY].tuple ); > 140 result->dst = tuple->dst; > 141 > 142 if (in_range(result, range)) { > 143 rcu_read_unlock(); > 144 return 1; > 145 } > 146 } > 147 } > 148 rcu_read_unlock(); > > If the ct is reused, NAT mapping will be wrong. It isn't a serious > problem, and can't be fixed, even though we check the reference > counter before using it, but we can't validate it with the original > tuple. > I call this a serious problem. netfilter is not a fuzzy logic. > Maybe we can do it in this way > > ct = nat->ct; > if (!nf_ct_is_dying(ct) && > atomic_inc_not_zero(&ct->ct_general.use))) { > if (nf_ct_ext_find(ct, NF_CT_EXT_NAT) == nat) { > /* ct is valid, do sth... */ > } > nf_ct_put(ct); > } > > I think two additional atomic operations are expensive. It isn't a good idea. > It depends. This is better than taking the conntrack lock. SLAB_DESTROY_BY_RCU is not allowing for fuzzy logic. Rules are we _must_ take a reference on object after finding it, and _recheck_ validity of the object before using it. Another way to avoid atomic operations in find_appropriate_src() is to use a seqcount (or seqlock), and change the seqcount every time something is changed in ct. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-03-31 14:47 ` Eric Dumazet 2011-04-01 2:02 ` Changli Gao @ 2011-04-05 11:49 ` Patrick McHardy 2011-05-21 15:42 ` Changli Gao 1 sibling, 1 reply; 12+ messages in thread From: Patrick McHardy @ 2011-04-05 11:49 UTC (permalink / raw) To: Eric Dumazet Cc: Oleg A. Arkhangelsky, Changli Gao, netfilter-devel, netdev, Paul E McKenney On 31.03.2011 16:47, Eric Dumazet wrote: > Le jeudi 31 mars 2011 à 18:03 +0400, "Oleg A. Arkhangelsky" a écrit : >> >> 26.03.2011, 16:44, "Changli Gao" <xiaosuo@gmail.com>: >>> On Thu, Mar 3, 2011 at 3:33 PM, Changli Gao <xiaosuo@gmail.com>; wrote: >>> >>>> Please try the patch attached and test if the problem is solved or not. Thanks. >>> >>> Any feedback? Thanks. >>> >> >> Seems that patch is fine. >> >> https://bugzilla.kernel.org/show_bug.cgi?id=21512 >> > > I wonder if this is not hiding another bug. > > Adding an RCU grace period might reduce the probability window. > > By the time nf_conntrack_free(ct) is called, no other cpu/thread > could/should use ct, or ct->ext ? > > Sure, another thread can find/pass_on ct in a lookup but should not use > it, since its refcount (ct_general.use) should be 0. > > Patrick ? I think what's happening is that the conntrack entry is destroyed and the NAT ct_extend destructor invoked, which removes the nat extension from the RCU protected bysource hash, after which the entire extension area is freed. Another CPU might still find the old NAT entry with undefined contents in the hash though, so I think using RCU to free the extension area is correct. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Kernel panic nf_nat_setup_info+0x5b3/0x6e0 2011-04-05 11:49 ` Patrick McHardy @ 2011-05-21 15:42 ` Changli Gao 0 siblings, 0 replies; 12+ messages in thread From: Changli Gao @ 2011-05-21 15:42 UTC (permalink / raw) To: Patrick McHardy Cc: Eric Dumazet, Oleg A. Arkhangelsky, netfilter-devel, netdev, Paul E McKenney On Tue, Apr 5, 2011 at 7:49 PM, Patrick McHardy <kaber@trash.net> wrote: > > I think what's happening is that the conntrack entry is destroyed > and the NAT ct_extend destructor invoked, which removes the nat > extension from the RCU protected bysource hash, after which the > entire extension area is freed. Another CPU might still find the > old NAT entry with undefined contents in the hash though, so I > think using RCU to free the extension area is correct. > What is the conclusion? Is my patch acceptable? Thanks. -- Regards, Changli Gao(xiaosuo@gmail.com) ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-05-21 15:58 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-02-23 17:07 Kernel panic nf_nat_setup_info+0x5b3/0x6e0 "Oleg A. Arkhangelsky" 2011-03-02 11:37 ` Patrick McHardy 2011-03-02 14:37 ` Changli Gao 2011-03-02 19:50 ` "Oleg A. Arkhangelsky" 2011-03-03 7:33 ` Changli Gao 2011-03-26 15:44 ` Changli Gao 2011-03-31 14:03 ` "Oleg A. Arkhangelsky" 2011-03-31 14:47 ` Eric Dumazet 2011-04-01 2:02 ` Changli Gao 2011-04-01 5:30 ` Eric Dumazet 2011-04-05 11:49 ` Patrick McHardy 2011-05-21 15:42 ` Changli Gao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).