* Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests @ 2013-04-10 9:04 Florian Westphal 2013-04-10 9:23 ` Patrick McHardy 0 siblings, 1 reply; 8+ messages in thread From: Florian Westphal @ 2013-04-10 9:04 UTC (permalink / raw) To: netfilter-devel ----- Forwarded message ----- Subject: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests CAI Qian <caiqian@redhat.com> wrote: [ CC'd nf-devel ] > Just hit this very often during IPv6 tests in both the latest stable > and mainline kernel. > > [ 3597.206166] Modules linked in: [..] > nf_nat_ipv4(F-) [..] > [ 3597.804861] RIP: 0010:[<ffffffffa03227f2>] [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > [ 3597.855207] RSP: 0018:ffff880202c63d40 EFLAGS: 00010246 > [ 3597.881350] RAX: 0000000000000000 RBX: ffff8801ac7bec28 RCX: ffff8801d0eedbe0 > [ 3597.917226] RDX: dead000000200200 RSI: 0000000000000011 RDI: ffffffffa03265b8 [..] > [ 3598.421036] <IRQ> > [ 3598.430467] [<ffffffffa0305bb4>] __nf_ct_ext_destroy+0x44/0x60 [nf_conntrack] > [ 3598.499191] [<ffffffffa02fd3fe>] nf_conntrack_free+0x2e/0x70 [nf_conntrack] > [ 3598.534121] [<ffffffffa02febed>] destroy_conntrack+0xbd/0x110 [nf_conntrack] > [ 3598.569981] [<ffffffff81532187>] nf_conntrack_destroy+0x17/0x20 > [ 3598.599579] [<ffffffffa02fe77c>] death_by_timeout+0xdc/0x1b0 [nf_conntrack] [..] > [ 3599.241868] Code: 83 ec 08 0f b6 58 11 84 db 74 43 48 01 c3 48 83 7b 20 00 74 39 48 c7 c7 b8 65 32 a0 e8 98 fc 2e e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 ba 00 02 20 00 00 00 ad de 48 c7 > [ 3599.337037] RIP [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] Looks like we tried to remove bysource hash twice (rdx is LIST_POISON_2). I wonder if this would explain it: static void nf_nat_l4proto_clean(u8 l3proto, u8 l4proto) { [..] /* Step 1 - remove from bysource hash */ clean.hash = true; for_each_net(net) nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean); A nfct->timer fires and a conntrack is free'd before step 2 memsets the nat extension. In that case, we would try to delete nat->bysource again? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ----- End forwarded message ----- ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests 2013-04-10 9:04 Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests Florian Westphal @ 2013-04-10 9:23 ` Patrick McHardy 2013-04-10 9:32 ` Florian Westphal 0 siblings, 1 reply; 8+ messages in thread From: Patrick McHardy @ 2013-04-10 9:23 UTC (permalink / raw) To: Florian Westphal; +Cc: netfilter-devel On Wed, Apr 10, 2013 at 11:04:36AM +0200, Florian Westphal wrote: > ----- Forwarded message ----- > > Subject: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests > > CAI Qian <caiqian@redhat.com> wrote: > > [ CC'd nf-devel ] > > > Just hit this very often during IPv6 tests in both the latest stable > > and mainline kernel. > > > > [ 3597.206166] Modules linked in: > [..] > > nf_nat_ipv4(F-) > [..] > > > [ 3597.804861] RIP: 0010:[<ffffffffa03227f2>] [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > > [ 3597.855207] RSP: 0018:ffff880202c63d40 EFLAGS: 00010246 > > [ 3597.881350] RAX: 0000000000000000 RBX: ffff8801ac7bec28 RCX: ffff8801d0eedbe0 > > [ 3597.917226] RDX: dead000000200200 RSI: 0000000000000011 RDI: ffffffffa03265b8 > [..] > > > [ 3598.421036] <IRQ> > > [ 3598.430467] [<ffffffffa0305bb4>] __nf_ct_ext_destroy+0x44/0x60 [nf_conntrack] > > [ 3598.499191] [<ffffffffa02fd3fe>] nf_conntrack_free+0x2e/0x70 [nf_conntrack] > > [ 3598.534121] [<ffffffffa02febed>] destroy_conntrack+0xbd/0x110 [nf_conntrack] > > [ 3598.569981] [<ffffffff81532187>] nf_conntrack_destroy+0x17/0x20 > > [ 3598.599579] [<ffffffffa02fe77c>] death_by_timeout+0xdc/0x1b0 [nf_conntrack] > [..] > > [ 3599.241868] Code: 83 ec 08 0f b6 58 11 84 db 74 43 48 01 c3 48 83 7b 20 00 74 39 48 c7 c7 b8 65 32 a0 e8 98 fc 2e e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 ba 00 02 20 00 00 00 ad de 48 c7 > > [ 3599.337037] RIP [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > > Looks like we tried to remove bysource hash twice (rdx is > LIST_POISON_2). > > I wonder if this would explain it: > > static void nf_nat_l4proto_clean(u8 l3proto, u8 l4proto) > { > [..] > /* Step 1 - remove from bysource hash */ > clean.hash = true; > for_each_net(net) > nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean); > > A nfct->timer fires and a conntrack is free'd before step 2 memsets the > nat extension. In that case, we would try to delete nat->bysource > again? Not sure I follow, we only invoke nf_nat_l4proto_clean() through nf_nat_l4proto_unregister(), right? Did this happen during module unload? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests 2013-04-10 9:23 ` Patrick McHardy @ 2013-04-10 9:32 ` Florian Westphal 2013-04-10 9:41 ` Patrick McHardy 0 siblings, 1 reply; 8+ messages in thread From: Florian Westphal @ 2013-04-10 9:32 UTC (permalink / raw) To: Patrick McHardy; +Cc: Florian Westphal, netfilter-devel, caiqian Patrick McHardy <kaber@trash.net> wrote: > On Wed, Apr 10, 2013 at 11:04:36AM +0200, Florian Westphal wrote: > > Subject: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests > > CAI Qian <caiqian@redhat.com> wrote: > > > Just hit this very often during IPv6 tests in both the latest stable > > > and mainline kernel. > > > > > > [ 3597.206166] Modules linked in: > > [..] > > > nf_nat_ipv4(F-) > > [..] > > > > > [ 3597.804861] RIP: 0010:[<ffffffffa03227f2>] [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > > > [ 3597.855207] RSP: 0018:ffff880202c63d40 EFLAGS: 00010246 > > > [ 3597.881350] RAX: 0000000000000000 RBX: ffff8801ac7bec28 RCX: ffff8801d0eedbe0 > > > [ 3597.917226] RDX: dead000000200200 RSI: 0000000000000011 RDI: ffffffffa03265b8 > > [..] > > > > > [ 3598.421036] <IRQ> > > > [ 3598.430467] [<ffffffffa0305bb4>] __nf_ct_ext_destroy+0x44/0x60 [nf_conntrack] > > > [ 3598.499191] [<ffffffffa02fd3fe>] nf_conntrack_free+0x2e/0x70 [nf_conntrack] > > > [ 3598.534121] [<ffffffffa02febed>] destroy_conntrack+0xbd/0x110 [nf_conntrack] > > > [ 3598.569981] [<ffffffff81532187>] nf_conntrack_destroy+0x17/0x20 > > > [ 3598.599579] [<ffffffffa02fe77c>] death_by_timeout+0xdc/0x1b0 [nf_conntrack] > > [..] > > > [ 3599.241868] Code: 83 ec 08 0f b6 58 11 84 db 74 43 48 01 c3 48 83 7b 20 00 74 39 48 c7 c7 b8 65 32 a0 e8 98 fc 2e e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 ba 00 02 20 00 00 00 ad de 48 c7 > > > [ 3599.337037] RIP [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > > > > Looks like we tried to remove bysource hash twice (rdx is > > LIST_POISON_2). > > > > I wonder if this would explain it: > > > > static void nf_nat_l4proto_clean(u8 l3proto, u8 l4proto) > > { > > [..] > > /* Step 1 - remove from bysource hash */ > > clean.hash = true; > > for_each_net(net) > > nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean); > > > > A nfct->timer fires and a conntrack is free'd before step 2 memsets the > > nat extension. In that case, we would try to delete nat->bysource > > again? > > Not sure I follow, we only invoke nf_nat_l4proto_clean() through > nf_nat_l4proto_unregister(), right? > > Did this happen during module unload? Looks like it, nf_nat_ipv4 is listed as F- in the oops trace. (afaics, "-" means "module going away"). Qian, do your tests involve module load/unload cycles? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests 2013-04-10 9:32 ` Florian Westphal @ 2013-04-10 9:41 ` Patrick McHardy 2013-04-10 14:56 ` Florian Westphal 0 siblings, 1 reply; 8+ messages in thread From: Patrick McHardy @ 2013-04-10 9:41 UTC (permalink / raw) To: Florian Westphal; +Cc: netfilter-devel, caiqian On Wed, Apr 10, 2013 at 11:32:04AM +0200, Florian Westphal wrote: > Patrick McHardy <kaber@trash.net> wrote: > > On Wed, Apr 10, 2013 at 11:04:36AM +0200, Florian Westphal wrote: > > > Subject: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests > > > CAI Qian <caiqian@redhat.com> wrote: > > > > Just hit this very often during IPv6 tests in both the latest stable > > > > and mainline kernel. > > > > > > > > [ 3597.206166] Modules linked in: > > > [..] > > > > nf_nat_ipv4(F-) > > > [..] > > > > > > > [ 3597.804861] RIP: 0010:[<ffffffffa03227f2>] [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > > > > [ 3597.855207] RSP: 0018:ffff880202c63d40 EFLAGS: 00010246 > > > > [ 3597.881350] RAX: 0000000000000000 RBX: ffff8801ac7bec28 RCX: ffff8801d0eedbe0 > > > > [ 3597.917226] RDX: dead000000200200 RSI: 0000000000000011 RDI: ffffffffa03265b8 > > > [..] > > > > > > > [ 3598.421036] <IRQ> > > > > [ 3598.430467] [<ffffffffa0305bb4>] __nf_ct_ext_destroy+0x44/0x60 [nf_conntrack] > > > > [ 3598.499191] [<ffffffffa02fd3fe>] nf_conntrack_free+0x2e/0x70 [nf_conntrack] > > > > [ 3598.534121] [<ffffffffa02febed>] destroy_conntrack+0xbd/0x110 [nf_conntrack] > > > > [ 3598.569981] [<ffffffff81532187>] nf_conntrack_destroy+0x17/0x20 > > > > [ 3598.599579] [<ffffffffa02fe77c>] death_by_timeout+0xdc/0x1b0 [nf_conntrack] > > > [..] > > > > [ 3599.241868] Code: 83 ec 08 0f b6 58 11 84 db 74 43 48 01 c3 48 83 7b 20 00 74 39 48 c7 c7 b8 65 32 a0 e8 98 fc 2e e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 ba 00 02 20 00 00 00 ad de 48 c7 > > > > [ 3599.337037] RIP [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > > > > > > Looks like we tried to remove bysource hash twice (rdx is > > > LIST_POISON_2). > > > > > > I wonder if this would explain it: > > > > > > static void nf_nat_l4proto_clean(u8 l3proto, u8 l4proto) > > > { > > > [..] > > > /* Step 1 - remove from bysource hash */ > > > clean.hash = true; > > > for_each_net(net) > > > nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean); > > > > > > A nfct->timer fires and a conntrack is free'd before step 2 memsets the > > > nat extension. In that case, we would try to delete nat->bysource > > > again? > > > > Not sure I follow, we only invoke nf_nat_l4proto_clean() through > > nf_nat_l4proto_unregister(), right? > > > > Did this happen during module unload? > > Looks like it, nf_nat_ipv4 is listed as F- in the oops trace. (afaics, > "-" means "module going away"). Yes, that seems like a real race condition. We probably could extend the nf_nat_lock sections to avoid this, but I wonder wether we should just kill those conntracks, the connections are not going to work after being "de-nated" anymore anyway. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests 2013-04-10 9:41 ` Patrick McHardy @ 2013-04-10 14:56 ` Florian Westphal 2013-04-10 14:57 ` Patrick McHardy 0 siblings, 1 reply; 8+ messages in thread From: Florian Westphal @ 2013-04-10 14:56 UTC (permalink / raw) To: Patrick McHardy; +Cc: Florian Westphal, netfilter-devel, caiqian Patrick McHardy <kaber@trash.net> wrote: > On Wed, Apr 10, 2013 at 11:32:04AM +0200, Florian Westphal wrote: > > Patrick McHardy <kaber@trash.net> wrote: > > > On Wed, Apr 10, 2013 at 11:04:36AM +0200, Florian Westphal wrote: > > > > > [ 3599.241868] Code: 83 ec 08 0f b6 58 11 84 db 74 43 48 01 c3 48 83 7b 20 00 74 39 48 c7 c7 b8 65 32 a0 e8 98 fc 2e e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 ba 00 02 20 00 00 00 ad de 48 c7 > > > > > [ 3599.337037] RIP [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > > > > > > > > Looks like we tried to remove bysource hash twice (rdx is > > > > LIST_POISON_2). > > > > > > > > I wonder if this would explain it: > > > > > > > > static void nf_nat_l4proto_clean(u8 l3proto, u8 l4proto) > > > > { > > > > [..] > > > > /* Step 1 - remove from bysource hash */ > > > > clean.hash = true; > > > > for_each_net(net) > > > > nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean); > > > > > > > > A nfct->timer fires and a conntrack is free'd before step 2 memsets the > > > > nat extension. In that case, we would try to delete nat->bysource > > > > again? > > > > > > Not sure I follow, we only invoke nf_nat_l4proto_clean() through > > > nf_nat_l4proto_unregister(), right? > > > > > > Did this happen during module unload? > > > > Looks like it, nf_nat_ipv4 is listed as F- in the oops trace. (afaics, > > "-" means "module going away"). > > Yes, that seems like a real race condition. We probably could extend the > nf_nat_lock sections to avoid this, but I wonder wether we should just kill > those conntracks, the connections are not going to work after being > "de-nated" anymore anyway. I like it, just killing them would make it a lot more simple. The clear-nat-extension-on-module-unload dance is getting out of hand, and, as you point out, the connections are not going to work anyway... ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests 2013-04-10 14:56 ` Florian Westphal @ 2013-04-10 14:57 ` Patrick McHardy 2013-04-11 9:34 ` Florian Westphal 0 siblings, 1 reply; 8+ messages in thread From: Patrick McHardy @ 2013-04-10 14:57 UTC (permalink / raw) To: Florian Westphal; +Cc: netfilter-devel, caiqian On Wed, Apr 10, 2013 at 04:56:21PM +0200, Florian Westphal wrote: > Patrick McHardy <kaber@trash.net> wrote: > > On Wed, Apr 10, 2013 at 11:32:04AM +0200, Florian Westphal wrote: > > > Patrick McHardy <kaber@trash.net> wrote: > > > > On Wed, Apr 10, 2013 at 11:04:36AM +0200, Florian Westphal wrote: > > > > > > [ 3599.241868] Code: 83 ec 08 0f b6 58 11 84 db 74 43 48 01 c3 48 83 7b 20 00 74 39 48 c7 c7 b8 65 32 a0 e8 98 fc 2e e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 ba 00 02 20 00 00 00 ad de 48 c7 > > > > > > [ 3599.337037] RIP [<ffffffffa03227f2>] nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > > > > > > > > > > Looks like we tried to remove bysource hash twice (rdx is > > > > > LIST_POISON_2). > > > > > > > > > > I wonder if this would explain it: > > > > > > > > > > static void nf_nat_l4proto_clean(u8 l3proto, u8 l4proto) > > > > > { > > > > > [..] > > > > > /* Step 1 - remove from bysource hash */ > > > > > clean.hash = true; > > > > > for_each_net(net) > > > > > nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean); > > > > > > > > > > A nfct->timer fires and a conntrack is free'd before step 2 memsets the > > > > > nat extension. In that case, we would try to delete nat->bysource > > > > > again? > > > > > > > > Not sure I follow, we only invoke nf_nat_l4proto_clean() through > > > > nf_nat_l4proto_unregister(), right? > > > > > > > > Did this happen during module unload? > > > > > > Looks like it, nf_nat_ipv4 is listed as F- in the oops trace. (afaics, > > > "-" means "module going away"). > > > > Yes, that seems like a real race condition. We probably could extend the > > nf_nat_lock sections to avoid this, but I wonder wether we should just kill > > those conntracks, the connections are not going to work after being > > "de-nated" anymore anyway. > > I like it, just killing them would make it a lot more simple. > > The clear-nat-extension-on-module-unload dance is getting out of hand, > and, as you point out, the connections are not going to work anyway... Yeah, lets just do that. Do you want to take care of this? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests 2013-04-10 14:57 ` Patrick McHardy @ 2013-04-11 9:34 ` Florian Westphal 2013-04-11 10:40 ` Patrick McHardy 0 siblings, 1 reply; 8+ messages in thread From: Florian Westphal @ 2013-04-11 9:34 UTC (permalink / raw) To: Patrick McHardy; +Cc: Florian Westphal, netfilter-devel, caiqian Patrick McHardy <kaber@trash.net> wrote: > > > > Looks like it, nf_nat_ipv4 is listed as F- in the oops trace. (afaics, > > > > "-" means "module going away"). > > > > > > Yes, that seems like a real race condition. We probably could extend the > > > nf_nat_lock sections to avoid this, but I wonder wether we should just kill > > > those conntracks, the connections are not going to work after being > > > "de-nated" anymore anyway. > > > > I like it, just killing them would make it a lot more simple. > > > > The clear-nat-extension-on-module-unload dance is getting out of hand, > > and, as you point out, the connections are not going to work anyway... > > Yeah, lets just do that. Do you want to take care of this? I can look into it, sure. However, i missed one important point: non-NAT'd connections have a null binding, and there doesn't seem to be a way to differentiate between real vs. null binding. Simply returning 1 for conntracks-with-nat-extension will zap every connection. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests 2013-04-11 9:34 ` Florian Westphal @ 2013-04-11 10:40 ` Patrick McHardy 0 siblings, 0 replies; 8+ messages in thread From: Patrick McHardy @ 2013-04-11 10:40 UTC (permalink / raw) To: Florian Westphal; +Cc: netfilter-devel, caiqian On Thu, Apr 11, 2013 at 11:34:52AM +0200, Florian Westphal wrote: > Patrick McHardy <kaber@trash.net> wrote: > > > > > Looks like it, nf_nat_ipv4 is listed as F- in the oops trace. (afaics, > > > > > "-" means "module going away"). > > > > > > > > Yes, that seems like a real race condition. We probably could extend the > > > > nf_nat_lock sections to avoid this, but I wonder wether we should just kill > > > > those conntracks, the connections are not going to work after being > > > > "de-nated" anymore anyway. > > > > > > I like it, just killing them would make it a lot more simple. > > > > > > The clear-nat-extension-on-module-unload dance is getting out of hand, > > > and, as you point out, the connections are not going to work anyway... > > > > Yeah, lets just do that. Do you want to take care of this? > > I can look into it, sure. > However, i missed one important point: non-NAT'd connections > have a null binding, and there doesn't seem to be a way to differentiate > between real vs. null binding. > > Simply returning 1 for conntracks-with-nat-extension will zap every > connection. We only set the IPS_SRC_NAT/IPS_DST_NAT flags for non-null bindings. Checking for these should work. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-04-11 10:40 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-04-10 9:04 Fwd: Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests Florian Westphal 2013-04-10 9:23 ` Patrick McHardy 2013-04-10 9:32 ` Florian Westphal 2013-04-10 9:41 ` Patrick McHardy 2013-04-10 14:56 ` Florian Westphal 2013-04-10 14:57 ` Patrick McHardy 2013-04-11 9:34 ` Florian Westphal 2013-04-11 10:40 ` Patrick McHardy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).