* Kernel crash on helper module unload
@ 2016-05-04 22:46 Joe Stringer
2016-05-04 23:28 ` Florian Westphal
0 siblings, 1 reply; 2+ messages in thread
From: Joe Stringer @ 2016-05-04 22:46 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal, Pablo Neira Ayuso, Jarno Rajahalme
Hi all,
I've noticed that you can crash the kernel by running FTP traffic
through to a netns, then removing the FTP helper module from the host.
Repro involves setting automatic helpers (default up until nf-next),
running an FTP client in one netns through to a server in another
netns with linux bridge providing L2 connectivity in between. If you
remove the namespaces after running traffic, then the netns cleanup +
hook unregistration is deferred to a workqueue. If you can unload the
FTP helper module before this code triggers, then the work item will
attempt to destroy helpers that were provided by the (now unloaded)
module. This piece fails, causing the BUG.
I've boiled it down to a repro script here:
https://gist.github.com/joestringer/465328172ee8960242142572b0ffc6e1
The FTP server used within is a simple python application here,
requires pyftpdlib:
https://github.com/openvswitch/ovs/blob/v2.5.0/tests/test-l7.py
Other dependencies are standard things like conntrack, ip, bridge-utils, wget.
In regards to affected kernels, I looked back as far as 3.13 and I can
still reproduce the issue with the above script.
Here's the kernel backtrace:
[ 136.808116] BUG: spinlock lockup suspected on CPU#0, kworker/u256:30/160
[ 136.808294] lock: 0xffff880069fd6400, .magic: dead4ead, .owner:
kworker/u256:30/160, .owner_cpu: 0
[ 136.808533] CPU: 0 PID: 160 Comm: kworker/u256:30 Tainted: G D
W 4.6.0-rc4-nn-fw-sct1+ #32
[ 136.808765] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[ 136.809026] 0000000000000000 ffff880064f5f588 ffffffff813b62be
ffff880064f5a340
[ 136.809372] ffff880069fd6400 ffff880064f5f5a8 ffffffff8117f836
ffff880069fd6400
[ 136.809720] 000000008ea72658 ffff880064f5f5d8 ffffffff810c16da
ffff880069fd6400
[ 136.810057] Call Trace:
[ 136.810174] [<ffffffff813b62be>] dump_stack+0x67/0x99
[ 136.810314] [<ffffffff8117f836>] spin_dump+0x90/0x95
[ 136.810452] [<ffffffff810c16da>] do_raw_spin_lock+0x9a/0x130
[ 136.810597] [<ffffffff817a5d7d>] _raw_spin_lock+0x5d/0x80
[ 136.810745] [<ffffffff817a02c7>] ? __schedule+0xc7/0xd00
[ 136.810885] [<ffffffff817a02c7>] __schedule+0xc7/0xd00
[ 136.811023] [<ffffffff8117f9b1>] ? printk+0x4d/0x4f
[ 136.811159] [<ffffffff817a0f3c>] schedule+0x3c/0x90
[ 136.811296] [<ffffffff8106b22d>] do_exit+0xb3d/0xc50
[ 136.811433] [<ffffffff810d0449>] ? kmsg_dump+0x109/0x180
[ 136.811574] [<ffffffff8101fea9>] oops_end+0x89/0xc0
[ 136.811711] [<ffffffff8105323e>] no_context+0x10e/0x380
[ 136.811850] [<ffffffff810535c3>] __bad_area_nosemaphore+0x113/0x210
[ 136.811999] [<ffffffff810536d4>] bad_area_nosemaphore+0x14/0x20
[ 136.812144] [<ffffffff8105377e>] __do_page_fault+0x9e/0x500
[ 136.812286] [<ffffffff81002038>] ? trace_hardirqs_off_thunk+0x1b/0x1d
[ 136.812437] [<ffffffff81053bec>] do_page_fault+0xc/0x10
[ 136.812580] [<ffffffff817a86b2>] page_fault+0x22/0x30
[ 136.812719] [<ffffffff8108d340>] ? kthread_data+0x10/0x20
[ 136.812860] [<ffffffff81086e9e>] wq_worker_sleeping+0xe/0x90
[ 136.813004] [<ffffffff817a0a51>] __schedule+0x851/0xd00
[ 136.813144] [<ffffffff813895b3>] ? put_io_context_active+0xa3/0xc0
[ 136.813292] [<ffffffff817a0f3c>] schedule+0x3c/0x90
[ 136.813428] [<ffffffff8106adc8>] do_exit+0x6d8/0xc50
[ 136.813571] [<ffffffff8101fea9>] oops_end+0x89/0xc0
[ 136.813707] [<ffffffff8105323e>] no_context+0x10e/0x380
[ 136.813847] [<ffffffff810535c3>] __bad_area_nosemaphore+0x113/0x210
[ 136.813996] [<ffffffff810536d4>] bad_area_nosemaphore+0x14/0x20
[ 136.814141] [<ffffffff8105377e>] __do_page_fault+0x9e/0x500
[ 136.814282] [<ffffffff81002038>] ? trace_hardirqs_off_thunk+0x1b/0x1d
[ 136.814433] [<ffffffff81053bec>] do_page_fault+0xc/0x10
[ 136.814571] [<ffffffff817a86b2>] page_fault+0x22/0x30
[ 136.814715] [<ffffffffa00bc797>] ? nf_ct_helper_destroy+0x97/0x170
[nf_conntrack]
[ 136.814937] [<ffffffffa00bc83f>] ?
nf_ct_helper_destroy+0x13f/0x170 [nf_conntrack]
[ 136.815163] [<ffffffffa00bc73c>] ? nf_ct_helper_destroy+0x3c/0x170
[nf_conntrack]
[ 136.815388] [<ffffffffa00b6c9c>] nf_ct_delete+0x3c/0x1e0 [nf_conntrack]
[ 136.815544] [<ffffffffa00bc9f0>] ?
nf_conntrack_helper_fini+0x30/0x30 [nf_conntrack]
[ 136.815768] [<ffffffffa00b75c8>] nf_ct_iterate_cleanup+0x258/0x270
[nf_conntrack]
[ 136.815990] [<ffffffffa00bcf0f>]
nf_ct_l3proto_pernet_unregister+0x2f/0x60 [nf_conntrack]
[ 136.816219] [<ffffffffa00370e9>] ipv4_net_exit+0x19/0x50 [nf_conntrack_ipv4]
[ 136.816377] [<ffffffff81668fa8>] ops_exit_list.isra.4+0x38/0x60
[ 136.816523] [<ffffffff8166a35e>] cleanup_net+0x1be/0x290
[ 136.816664] [<ffffffff81085b2c>] process_one_work+0x1dc/0x660
[ 136.816808] [<ffffffff81085ab1>] ? process_one_work+0x161/0x660
[ 136.816953] [<ffffffff810860db>] worker_thread+0x12b/0x4a0
[ 136.817095] [<ffffffff81085fb0>] ? process_one_work+0x660/0x660
[ 136.817240] [<ffffffff8108ca22>] kthread+0xf2/0x110
[ 136.817376] [<ffffffff817a6c02>] ret_from_fork+0x22/0x40
[ 136.817515] [<ffffffff8108c930>] ? kthread_create_on_node+0x220/0x220
It seems like there are a couple of mitigations in the nf-next
pipeline at the moment. Firstly, if automatic helpers are turned off
then the namespace will not automatically add the FTP helper to
connections within the namespace. This decreases the likelihood of
hitting this issue, but you can still hit it if you re-enable the
automatic helpers.
Secondly, Florian's work to merge the conntrack tables across
namespaces seems to fix the issue at least with the above script.
While the basic repro script is unable to trigger the issue with those
patches, I wonder if a similar issue may persist due to the lack of
refcounting on helpers from rules. ie could we reproduce the issue by
explicitly setting FTP helper targets even on the latest code?
Cheers,
Joe
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Kernel crash on helper module unload
2016-05-04 22:46 Kernel crash on helper module unload Joe Stringer
@ 2016-05-04 23:28 ` Florian Westphal
0 siblings, 0 replies; 2+ messages in thread
From: Florian Westphal @ 2016-05-04 23:28 UTC (permalink / raw)
To: Joe Stringer
Cc: netfilter-devel, Florian Westphal, Pablo Neira Ayuso,
Jarno Rajahalme
Joe Stringer <joe@ovn.org> wrote:
> Hi all,
>
> I've noticed that you can crash the kernel by running FTP traffic
> through to a netns, then removing the FTP helper module from the host.
> Repro involves setting automatic helpers (default up until nf-next),
> running an FTP client in one netns through to a server in another
> netns with linux bridge providing L2 connectivity in between. If you
> remove the namespaces after running traffic, then the netns cleanup +
> hook unregistration is deferred to a workqueue. If you can unload the
> FTP helper module before this code triggers, then the work item will
> attempt to destroy helpers that were provided by the (now unloaded)
> module. This piece fails, causing the BUG.
>
> I've boiled it down to a repro script here:
> https://gist.github.com/joestringer/465328172ee8960242142572b0ffc6e1
>
> The FTP server used within is a simple python application here,
> requires pyftpdlib:
> https://github.com/openvswitch/ovs/blob/v2.5.0/tests/test-l7.py
Thanks.
> Other dependencies are standard things like conntrack, ip, bridge-utils, wget.
>
> In regards to affected kernels, I looked back as far as 3.13 and I can
> still reproduce the issue with the above script.
>
> Here's the kernel backtrace:
>
> [ 136.808116] BUG: spinlock lockup suspected on CPU#0, kworker/u256:30/160
> [ 136.808294] lock: 0xffff880069fd6400, .magic: dead4ead, .owner:
> kworker/u256:30/160, .owner_cpu: 0
[..]
AFAIU following happens:
1. ct is created with ftp helper in netns x
2. netns x gets destroyed
3. netns destruction is scheduled
4. netns destruction wq starts, removes netns from global list
5. ftp helper is unloaded, which resets all helpers of the conntracks
... but because netns is already gone from list the for_each_net() loop
doesn't include it, so we do not change any of the conntracks in net
namespaces that are already dead.
5. netns destruction invokes destructor for rmmod'ed helper
Main problem is that the netns unification doesn't fully resolve this
problem, as the confirmed lists are still part of the net namespace,
i.e. a helper assigned to a conntrack entry that isn't in the table, but
sitting on unconfirmed list would also trigger this bug.
I'm afraid this is similar mess as the one fixed in
commit 200b916f3575bdf11609cb447661b8d5957b0bbf
Author: Cong Wang <cwang@twopensource.com>
Date: Mon May 12 15:11:20 2014 -0700
rtnetlink: wait for unregistering devices in rtnl_link_unregister()
And we probably need to play games w. net_mutex :-|
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2016-05-04 23:28 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-04 22:46 Kernel crash on helper module unload Joe Stringer
2016-05-04 23:28 ` Florian Westphal
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).