* net_ns cleanup / RCU overhead @ 2014-08-20 5:58 Simon Kirby 2014-08-28 19:24 ` Paul E. McKenney 0 siblings, 1 reply; 12+ messages in thread From: Simon Kirby @ 2014-08-20 5:58 UTC (permalink / raw) To: linux-kernel, netdev; +Cc: Eric W. Biederman Hello! In trying to figure out what happened to a box running lots of vsftpd since we deployed a CONFIG_NET_NS=y kernel to it, we found that the (wall) time needed for cleanup_net() to complete, even on an idle box, can be quite long: #!/bin/bash ip netns delete test >&/dev/null while ip netns add test; do echo hi ip netns delete test done On my desktop and typical hosts, this prints at only around 4 or 6 per second. While this is happening, "vmstat 1" reports 100% idle, and there there are D-state processes with stacks similar to: 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff 32220 ip D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff copy_net_ns() is waiting on net_mutex which is held by cleanup_net(). vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that case). I didn't find anything that makes 2.6.35 any faster, but on Debian 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10, 3.16, though still not anything I'd ever want to rely on per connection. C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c Kernel stack "top": http://0x.ca/sim/ref/tools/pstack What's going on here? Simon- ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-20 5:58 net_ns cleanup / RCU overhead Simon Kirby @ 2014-08-28 19:24 ` Paul E. McKenney 2014-08-28 19:44 ` Simon Kirby 0 siblings, 1 reply; 12+ messages in thread From: Paul E. McKenney @ 2014-08-28 19:24 UTC (permalink / raw) To: Simon Kirby; +Cc: linux-kernel, netdev, Eric W. Biederman On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote: > Hello! > > In trying to figure out what happened to a box running lots of vsftpd > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the > (wall) time needed for cleanup_net() to complete, even on an idle box, > can be quite long: > > #!/bin/bash > > ip netns delete test >&/dev/null > while ip netns add test; do > echo hi > ip netns delete test > done > > On my desktop and typical hosts, this prints at only around 4 or 6 per > second. While this is happening, "vmstat 1" reports 100% idle, and there > there are D-state processes with stacks similar to: > > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff > > 32220 ip D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff > > copy_net_ns() is waiting on net_mutex which is held by cleanup_net(). > > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that > case). I didn't find anything that makes 2.6.35 any faster, but on Debian > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10, > 3.16, though still not anything I'd ever want to rely on per connection. > > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c > > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack > > What's going on here? That is a bit slow for many configurations, but there are some exceptions. So, what is your kernel's .config? Thanx, Paul ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-28 19:24 ` Paul E. McKenney @ 2014-08-28 19:44 ` Simon Kirby 2014-08-28 20:33 ` Eric W. Biederman 0 siblings, 1 reply; 12+ messages in thread From: Simon Kirby @ 2014-08-28 19:44 UTC (permalink / raw) To: Paul E. McKenney; +Cc: linux-kernel, netdev, Eric W. Biederman On Thu, Aug 28, 2014 at 12:24:31PM -0700, Paul E. McKenney wrote: > On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote: > > Hello! > > > > In trying to figure out what happened to a box running lots of vsftpd > > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the > > (wall) time needed for cleanup_net() to complete, even on an idle box, > > can be quite long: > > > > #!/bin/bash > > > > ip netns delete test >&/dev/null > > while ip netns add test; do > > echo hi > > ip netns delete test > > done > > > > On my desktop and typical hosts, this prints at only around 4 or 6 per > > second. While this is happening, "vmstat 1" reports 100% idle, and there > > there are D-state processes with stacks similar to: > > > > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff > > > > 32220 ip D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff > > > > copy_net_ns() is waiting on net_mutex which is held by cleanup_net(). > > > > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment > > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that > > case). I didn't find anything that makes 2.6.35 any faster, but on Debian > > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10, > > 3.16, though still not anything I'd ever want to rely on per connection. > > > > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c > > > > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack > > > > What's going on here? > > That is a bit slow for many configurations, but there are some exceptions. > > So, what is your kernel's .config? I was unable to find a config (or stock kernel) that was any different, but here's the one we're using: http://0x.ca/sim/ref/3.10/config-3.10.53 How fast does the above test run for you? We've been running with the attached, which has helped a little, but it's still quite slow in our particular use case (vsftpd), and with the above test. Should I enable RCU_TRACE or STALL_INFO with a low timeout or something? Simon- -- >8 -- Subject: [PATCH] netns: use synchronize_rcu_expedited instead of synchronize_rcu Similar to ef323088, with synchronize_rcu(), we are only able to create and destroy about 4 or 7 net namespaces per second, which really puts a dent in the performance of programs attempting to use CLONE_NEWNET for privilege separation (vsftpd, chromium). --- net/core/net_namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 85b6269..6dcb4b3 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -296,7 +296,7 @@ static void cleanup_net(struct work_struct *work) * This needs to be before calling the exit() notifiers, so * the rcu_barrier() below isn't sufficient alone. */ - synchronize_rcu(); + synchronize_rcu_expedited(); /* Run all of the network namespace exit methods */ list_for_each_entry_reverse(ops, &pernet_list, list) -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-28 19:44 ` Simon Kirby @ 2014-08-28 20:33 ` Eric W. Biederman 2014-08-28 20:46 ` Paul E. McKenney 0 siblings, 1 reply; 12+ messages in thread From: Eric W. Biederman @ 2014-08-28 20:33 UTC (permalink / raw) To: Simon Kirby; +Cc: Paul E. McKenney, linux-kernel, netdev Simon Kirby <sim@hostway.ca> writes: > On Thu, Aug 28, 2014 at 12:24:31PM -0700, Paul E. McKenney wrote: > >> On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote: >> > Hello! >> > >> > In trying to figure out what happened to a box running lots of vsftpd >> > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the >> > (wall) time needed for cleanup_net() to complete, even on an idle box, >> > can be quite long: >> > >> > #!/bin/bash >> > >> > ip netns delete test >&/dev/null >> > while ip netns add test; do >> > echo hi >> > ip netns delete test >> > done >> > >> > On my desktop and typical hosts, this prints at only around 4 or 6 per >> > second. While this is happening, "vmstat 1" reports 100% idle, and there >> > there are D-state processes with stacks similar to: >> > >> > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff >> > >> > 32220 ip D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff >> > >> > copy_net_ns() is waiting on net_mutex which is held by cleanup_net(). >> > >> > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment >> > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that >> > case). I didn't find anything that makes 2.6.35 any faster, but on Debian >> > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10, >> > 3.16, though still not anything I'd ever want to rely on per connection. >> > >> > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c >> > >> > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack >> > >> > What's going on here? >> >> That is a bit slow for many configurations, but there are some exceptions. >> >> So, what is your kernel's .config? > > I was unable to find a config (or stock kernel) that was any different, > but here's the one we're using: http://0x.ca/sim/ref/3.10/config-3.10.53 > > How fast does the above test run for you? > > We've been running with the attached, which has helped a little, but it's > still quite slow in our particular use case (vsftpd), and with the above n> test. Should I enable RCU_TRACE or STALL_INFO with a low timeout or > something? I just want to add a little bit more analysis to this. What we desire to be fast is the copy_net_ns, cleanup_net is batched and asynchronous which nothing really cares how long it takes except that cleanup_net holds the net_mutex and thus blocks copy_net_ns. The puzzle is why and which rcu delays Simon is seeing in the network namespace cleanup path, as it seems like the synchronize_rcu is not the only one, and in the case of vsftp with trivail network namespaces where nothing has been done we should not need to delay. Eric > Simon- > > -- >8 -- > Subject: [PATCH] netns: use synchronize_rcu_expedited instead of > synchronize_rcu > > Similar to ef323088, with synchronize_rcu(), we are only able to create > and destroy about 4 or 7 net namespaces per second, which really puts a > dent in the performance of programs attempting to use CLONE_NEWNET for > privilege separation (vsftpd, chromium). > --- > net/core/net_namespace.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c > index 85b6269..6dcb4b3 100644 > --- a/net/core/net_namespace.c > +++ b/net/core/net_namespace.c > @@ -296,7 +296,7 @@ static void cleanup_net(struct work_struct *work) > * This needs to be before calling the exit() notifiers, so > * the rcu_barrier() below isn't sufficient alone. > */ > - synchronize_rcu(); > + synchronize_rcu_expedited(); > > /* Run all of the network namespace exit methods */ > list_for_each_entry_reverse(ops, &pernet_list, list) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-28 20:33 ` Eric W. Biederman @ 2014-08-28 20:46 ` Paul E. McKenney 2014-08-29 0:40 ` Simon Kirby 0 siblings, 1 reply; 12+ messages in thread From: Paul E. McKenney @ 2014-08-28 20:46 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Simon Kirby, linux-kernel, netdev On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote: > Simon Kirby <sim@hostway.ca> writes: > > > On Thu, Aug 28, 2014 at 12:24:31PM -0700, Paul E. McKenney wrote: > > > >> On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote: > >> > Hello! > >> > > >> > In trying to figure out what happened to a box running lots of vsftpd > >> > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the > >> > (wall) time needed for cleanup_net() to complete, even on an idle box, > >> > can be quite long: > >> > > >> > #!/bin/bash > >> > > >> > ip netns delete test >&/dev/null > >> > while ip netns add test; do > >> > echo hi > >> > ip netns delete test > >> > done > >> > > >> > On my desktop and typical hosts, this prints at only around 4 or 6 per > >> > second. While this is happening, "vmstat 1" reports 100% idle, and there > >> > there are D-state processes with stacks similar to: > >> > > >> > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff > >> > > >> > 32220 ip D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff > >> > > >> > copy_net_ns() is waiting on net_mutex which is held by cleanup_net(). > >> > > >> > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment > >> > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that > >> > case). I didn't find anything that makes 2.6.35 any faster, but on Debian > >> > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10, > >> > 3.16, though still not anything I'd ever want to rely on per connection. > >> > > >> > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c > >> > > >> > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack > >> > > >> > What's going on here? > >> > >> That is a bit slow for many configurations, but there are some exceptions. > >> > >> So, what is your kernel's .config? > > > > I was unable to find a config (or stock kernel) that was any different, > > but here's the one we're using: http://0x.ca/sim/ref/3.10/config-3.10.53 > > > > How fast does the above test run for you? > > > > We've been running with the attached, which has helped a little, but it's > > still quite slow in our particular use case (vsftpd), and with the above > n> test. Should I enable RCU_TRACE or STALL_INFO with a low timeout or > > something? > > I just want to add a little bit more analysis to this. > > What we desire to be fast is the copy_net_ns, cleanup_net is batched and > asynchronous which nothing really cares how long it takes except that > cleanup_net holds the net_mutex and thus blocks copy_net_ns. > > The puzzle is why and which rcu delays Simon is seeing in the network > namespace cleanup path, as it seems like the synchronize_rcu is not > the only one, and in the case of vsftp with trivail network namespaces > where nothing has been done we should not need to delay. Indeed, given the version and .config, I can't see why any individual RCU grace-period operation would be particularly slow. I suggest using ftrace on synchronize_rcu() and friends. Thanx, Paul > Eric > > > > Simon- > > > > -- >8 -- > > Subject: [PATCH] netns: use synchronize_rcu_expedited instead of > > synchronize_rcu > > > > Similar to ef323088, with synchronize_rcu(), we are only able to create > > and destroy about 4 or 7 net namespaces per second, which really puts a > > dent in the performance of programs attempting to use CLONE_NEWNET for > > privilege separation (vsftpd, chromium). > > --- > > net/core/net_namespace.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c > > index 85b6269..6dcb4b3 100644 > > --- a/net/core/net_namespace.c > > +++ b/net/core/net_namespace.c > > @@ -296,7 +296,7 @@ static void cleanup_net(struct work_struct *work) > > * This needs to be before calling the exit() notifiers, so > > * the rcu_barrier() below isn't sufficient alone. > > */ > > - synchronize_rcu(); > > + synchronize_rcu_expedited(); > > > > /* Run all of the network namespace exit methods */ > > list_for_each_entry_reverse(ops, &pernet_list, list) > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-28 20:46 ` Paul E. McKenney @ 2014-08-29 0:40 ` Simon Kirby 2014-08-29 3:57 ` Julian Anastasov 2014-08-30 2:52 ` Paul E. McKenney 0 siblings, 2 replies; 12+ messages in thread From: Simon Kirby @ 2014-08-29 0:40 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Eric W. Biederman, linux-kernel, netdev On Thu, Aug 28, 2014 at 01:46:58PM -0700, Paul E. McKenney wrote: > On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote: > > > I just want to add a little bit more analysis to this. > > > > What we desire to be fast is the copy_net_ns, cleanup_net is batched and > > asynchronous which nothing really cares how long it takes except that > > cleanup_net holds the net_mutex and thus blocks copy_net_ns. > > > > The puzzle is why and which rcu delays Simon is seeing in the network > > namespace cleanup path, as it seems like the synchronize_rcu is not > > the only one, and in the case of vsftp with trivail network namespaces > > where nothing has been done we should not need to delay. > > Indeed, given the version and .config, I can't see why any individual > RCU grace-period operation would be particularly slow. > > I suggest using ftrace on synchronize_rcu() and friends. I made a parallel net namespace create/destroy benchmark that prints the progress and time to create and cleanup 32 unshare()d child processes: http://0x.ca/sim/ref/tools/netnsbench.c I noticed that if I haven't run it for a while, the first batch often is fast, followed by slowness from then on: ++++++++++++++++++++++++++++++++-------------------------------- 0.039478s ++++++++++++++++++++-----+----------------+++++++++---------++-- 4.463837s +++++++++++++++++++++++++------+--------------------++++++------ 3.011882s +++++++++++++++---+-------------++++++++++++++++---------------- 2.283993s Fiddling around on a stock kernel, "echo 1 > /sys/kernel/rcu_expedited" makes behaviour change as it did with my patch: ++-++-+++-+-----+-+-++-+-++--++-+--+-+-++--++-+-+-+-++-+--++---- 0.801406s +-+-+-++-+-+-+-+-++--+-+-++-+--++-+-+-+-+-+-+-+-+-+-+-+--++-+--- 0.872011s ++--+-++--+-++--+-++--+-+-+-+-++-+--++--+-++-+-+-+-+--++-+-+-+-- 0.946745s How would I use ftrace on synchronize_rcu() here? As Eric said, cleanup_net() is batched, but while it is cleaning up, net_mutex is held. Isn't the issue just that net_mutex is held while some other things are going on that are meant to be lazy / batched? What is net_mutex protecting in cleanup_net()? I noticed that [kworker/u16:0]'s stack is often: [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60 [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 [<ffffffff81062997>] process_one_work+0x157/0x440 [<ffffffff81063303>] worker_thread+0x63/0x520 [<ffffffff81068b96>] kthread+0xd6/0xf0 [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff and [<ffffffff81095364>] _rcu_barrier+0x154/0x1f0 [<ffffffff81095450>] rcu_barrier+0x10/0x20 [<ffffffff81102c2c>] kmem_cache_destroy+0x6c/0xb0 [<ffffffffa0089e97>] nf_conntrack_cleanup_net_list+0x167/0x1c0 [nf_conntrack] [<ffffffffa008aab5>] nf_conntrack_pernet_exit+0x65/0x70 [nf_conntrack] [<ffffffff81720353>] ops_exit_list.isra.4+0x53/0x60 [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 [<ffffffff81062997>] process_one_work+0x157/0x440 [<ffffffff81063303>] worker_thread+0x63/0x520 [<ffffffff81068b96>] kthread+0xd6/0xf0 [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff So I tried flushing iptables rules and rmmoding netfilter bits: ++++++++++++++++++++-+--------------------+++++++++++----------- 0.179940s ++++++++++++++--+-------------+++++++++++++++++----------------- 0.151988s ++++++++++++++++++++++++++++---+--------------------------+++--- 0.159967s ++++++++++++++++++++++----------------------++++++++++---------- 0.175964s Expedited: ++-+--++-+-+-+-+-+-+--++-+-+-++-+-+-+--++-+-+-+-+-+-+-+-+-+-+--- 0.079988s ++-+-+-+-+-+-+-+-+-+-+-+--++-+--++-+--+-++-+-+--++-+-+-+-+-+-+-- 0.089347s ++++--+++--++--+-+++++++-+++++--------------++-+-+--++-+-+--++-- 0.081566s +++++-+++-------++-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+--- 0.089026s So, much faster. It seems that just loading nf_conntrack_ipv4 (like by running iptables -t nat -nvL) is enough to slow it way down. But it is still capable of being fast, as above. Simon- ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-29 0:40 ` Simon Kirby @ 2014-08-29 3:57 ` Julian Anastasov 2014-08-29 21:57 ` Eric W. Biederman 2014-08-30 2:52 ` Paul E. McKenney 1 sibling, 1 reply; 12+ messages in thread From: Julian Anastasov @ 2014-08-29 3:57 UTC (permalink / raw) To: Simon Kirby; +Cc: Paul E. McKenney, Eric W. Biederman, linux-kernel, netdev Hello, On Thu, 28 Aug 2014, Simon Kirby wrote: > I noticed that [kworker/u16:0]'s stack is often: > > [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 > [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 > [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] I guess the problem is in nf_nat_net_exit, may be other nf exit handlers too. pernet-exit handlers should avoid synchronize_rcu and rcu_barrier. A RCU callback and rcu_barrier in module-exit is the way to go. cleanup_net includes rcu_barrier, so pernet-exit does not need such calls. > [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60 > [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 > [<ffffffff81062997>] process_one_work+0x157/0x440 > [<ffffffff81063303>] worker_thread+0x63/0x520 > [<ffffffff81068b96>] kthread+0xd6/0xf0 > [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-29 3:57 ` Julian Anastasov @ 2014-08-29 21:57 ` Eric W. Biederman 2014-08-29 23:52 ` Florian Westphal 2014-08-30 8:20 ` Julian Anastasov 0 siblings, 2 replies; 12+ messages in thread From: Eric W. Biederman @ 2014-08-29 21:57 UTC (permalink / raw) To: Julian Anastasov Cc: Simon Kirby, Paul E. McKenney, linux-kernel, netdev, Florian Westphal, Pablo Neira Ayuso Julian Anastasov <ja@ssi.bg> writes: > Hello, > > On Thu, 28 Aug 2014, Simon Kirby wrote: > >> I noticed that [kworker/u16:0]'s stack is often: >> >> [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 >> [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 >> [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] > > I guess the problem is in nf_nat_net_exit, > may be other nf exit handlers too. pernet-exit handlers > should avoid synchronize_rcu and rcu_barrier. > A RCU callback and rcu_barrier in module-exit is the way > to go. cleanup_net includes rcu_barrier, so pernet-exit > does not need such calls. In principle I agree, however in this particular case it looks a bit tricky because a separate hash table to track nat state per network namespace. At the same time all of the packets should be drained before we get to nf_nat_net_exit so it doesn't look the synchronize_rcu in nf_nat_exit is actually protecting anything. Further calling a rcu delay function in net_exit methods largely destroys the batched cleanup of network namespaces, so it is very unpleasant. Could someone who knows nf_nat_core.c better than I do look and see if we can just remove the synchronize_rcu in nf_nat_exit? >> [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60 >> [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 >> [<ffffffff81062997>] process_one_work+0x157/0x440 >> [<ffffffff81063303>] worker_thread+0x63/0x520 >> [<ffffffff81068b96>] kthread+0xd6/0xf0 >> [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 >> [<ffffffffffffffff>] 0xffffffffffffffff Eric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-29 21:57 ` Eric W. Biederman @ 2014-08-29 23:52 ` Florian Westphal 2014-08-30 2:56 ` Paul E. McKenney 2014-08-30 8:20 ` Julian Anastasov 1 sibling, 1 reply; 12+ messages in thread From: Florian Westphal @ 2014-08-29 23:52 UTC (permalink / raw) To: Eric W. Biederman Cc: Julian Anastasov, Simon Kirby, Paul E. McKenney, linux-kernel, netdev, Florian Westphal, Pablo Neira Ayuso Eric W. Biederman <ebiederm@xmission.com> wrote: > Julian Anastasov <ja@ssi.bg> writes: > > > Hello, > > > > On Thu, 28 Aug 2014, Simon Kirby wrote: > > > >> I noticed that [kworker/u16:0]'s stack is often: > >> > >> [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 > >> [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 > >> [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] > > > > I guess the problem is in nf_nat_net_exit, > > may be other nf exit handlers too. pernet-exit handlers > > should avoid synchronize_rcu and rcu_barrier. > > A RCU callback and rcu_barrier in module-exit is the way > > to go. cleanup_net includes rcu_barrier, so pernet-exit > > does not need such calls. > > In principle I agree, however in this particular case it looks a bit > tricky because a separate hash table to track nat state per network > namespace. > > At the same time all of the packets should be drained before > we get to nf_nat_net_exit so it doesn't look the synchronize_rcu > in nf_nat_exit is actually protecting anything. Hmm, the problem is with the conntrack entries living in the netns being destroyed. I don't think they are guaranteed to be removed by the time the nat netns exit function runs. > Further calling a rcu delay function in net_exit methods largely > destroys the batched cleanup of network namespaces, so it is very > unpleasant. > > Could someone who knows nf_nat_core.c better than I do look and > see if we can just remove the synchronize_rcu in nf_nat_exit? If I remember correctly its needed to ensure that all conntracks with nat extensions that might still be referenced on other cpu have finished (i.e., nf_conntrack_destroy() has been called, which calls nf_nat_cleanup_conntrack() which deletes the extension from the hash table). As we remove the ct from that table ourselves EXCEPT in the case where we cannot steal the timers' reference we should be able to avoid that call virtually every time. Perhaps this is worth a shot (not even compile tested): diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index 4e0b478..80cfe10 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -508,6 +508,7 @@ EXPORT_SYMBOL_GPL(nf_nat_packet); struct nf_nat_proto_clean { u8 l3proto; u8 l4proto; + bool need_sync_rcu; }; /* kill conntracks with affected NAT section */ @@ -528,23 +529,32 @@ static int nf_nat_proto_remove(struct nf_conn *i, void *data) static int nf_nat_proto_clean(struct nf_conn *ct, void *data) { + struct nf_nat_proto_clean *clean = data; struct nf_conn_nat *nat = nfct_nat(ct); - if (nf_nat_proto_remove(ct, data)) - return 1; - if (!nat || !nat->ct) return 0; - /* This netns is being destroyed, and conntrack has nat null binding. + /* This netns is being destroyed, and conntrack has nat binding. * Remove it from bysource hash, as the table will be freed soon. * - * Else, when the conntrack is destoyed, nf_nat_cleanup_conntrack() + * Else, when the conntrack is destroyed, nf_nat_cleanup_conntrack() * will delete entry from already-freed table. */ - if (!del_timer(&ct->timeout)) + if (!del_timer(&ct->timeout)) { + /* We have nat binding, but destruction + * might already be in progress. + * + * nat entry is removed only after last + * nf_ct_put(). + */ + clean->need_sync_rcu = true; return 1; + } + /* We stole refcount owned by timer; + * conntrack cannot go away. + */ spin_lock_bh(&nf_nat_lock); hlist_del_rcu(&nat->bysource); ct->status &= ~IPS_NAT_DONE_MASK; @@ -553,6 +563,9 @@ static int nf_nat_proto_clean(struct nf_conn *ct, void *data) add_timer(&ct->timeout); + if (nf_nat_proto_remove(ct, data)) + return 1; + /* don't delete conntrack. Although that would make things a lot * simpler, we'd end up flushing all conntracks on nat rmmod. */ @@ -830,7 +843,8 @@ static void __net_exit nf_nat_net_exit(struct net *net) struct nf_nat_proto_clean clean = {}; nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean, 0, 0); - synchronize_rcu(); + if (clean.need_sync_rcu) + synchronize_rcu(); nf_ct_free_hashtable(net->ct.nat_bysource, net->ct.nat_htable_size); } ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-29 23:52 ` Florian Westphal @ 2014-08-30 2:56 ` Paul E. McKenney 0 siblings, 0 replies; 12+ messages in thread From: Paul E. McKenney @ 2014-08-30 2:56 UTC (permalink / raw) To: Florian Westphal Cc: Eric W. Biederman, Julian Anastasov, Simon Kirby, linux-kernel, netdev, Pablo Neira Ayuso On Sat, Aug 30, 2014 at 01:52:06AM +0200, Florian Westphal wrote: > Eric W. Biederman <ebiederm@xmission.com> wrote: > > Julian Anastasov <ja@ssi.bg> writes: > > > > > Hello, > > > > > > On Thu, 28 Aug 2014, Simon Kirby wrote: > > > > > >> I noticed that [kworker/u16:0]'s stack is often: > > >> > > >> [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 > > >> [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 > > >> [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] > > > > > > I guess the problem is in nf_nat_net_exit, > > > may be other nf exit handlers too. pernet-exit handlers > > > should avoid synchronize_rcu and rcu_barrier. > > > A RCU callback and rcu_barrier in module-exit is the way > > > to go. cleanup_net includes rcu_barrier, so pernet-exit > > > does not need such calls. > > > > In principle I agree, however in this particular case it looks a bit > > tricky because a separate hash table to track nat state per network > > namespace. > > > > At the same time all of the packets should be drained before > > we get to nf_nat_net_exit so it doesn't look the synchronize_rcu > > in nf_nat_exit is actually protecting anything. > > Hmm, the problem is with the conntrack entries living in the netns being > destroyed. > > I don't think they are guaranteed to be removed by the time > the nat netns exit function runs. > > > Further calling a rcu delay function in net_exit methods largely > > destroys the batched cleanup of network namespaces, so it is very > > unpleasant. > > > > Could someone who knows nf_nat_core.c better than I do look and > > see if we can just remove the synchronize_rcu in nf_nat_exit? > > If I remember correctly its needed to ensure that > all conntracks with nat extensions that might still be referenced > on other cpu have finished (i.e., nf_conntrack_destroy() has been > called, which calls nf_nat_cleanup_conntrack() which deletes > the extension from the hash table). > > As we remove the ct from that table ourselves EXCEPT in the > case where we cannot steal the timers' reference we should > be able to avoid that call virtually every time. > > Perhaps this is worth a shot (not even compile tested): > > diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c > index 4e0b478..80cfe10 100644 > --- a/net/netfilter/nf_nat_core.c > +++ b/net/netfilter/nf_nat_core.c > @@ -508,6 +508,7 @@ EXPORT_SYMBOL_GPL(nf_nat_packet); > struct nf_nat_proto_clean { > u8 l3proto; > u8 l4proto; > + bool need_sync_rcu; > }; > > /* kill conntracks with affected NAT section */ > @@ -528,23 +529,32 @@ static int nf_nat_proto_remove(struct nf_conn *i, void *data) > > static int nf_nat_proto_clean(struct nf_conn *ct, void *data) > { > + struct nf_nat_proto_clean *clean = data; > struct nf_conn_nat *nat = nfct_nat(ct); > > - if (nf_nat_proto_remove(ct, data)) > - return 1; > - > if (!nat || !nat->ct) > return 0; > > - /* This netns is being destroyed, and conntrack has nat null binding. > + /* This netns is being destroyed, and conntrack has nat binding. > * Remove it from bysource hash, as the table will be freed soon. > * > - * Else, when the conntrack is destoyed, nf_nat_cleanup_conntrack() > + * Else, when the conntrack is destroyed, nf_nat_cleanup_conntrack() > * will delete entry from already-freed table. > */ > - if (!del_timer(&ct->timeout)) > + if (!del_timer(&ct->timeout)) { > + /* We have nat binding, but destruction > + * might already be in progress. > + * > + * nat entry is removed only after last > + * nf_ct_put(). > + */ > + clean->need_sync_rcu = true; So this happens only if we race with the timer handler? If so, this patch might give good speedups. (Can't comment on any other correctness issues due to unfamiliarity with the code and what it is trying to do.) Thanx, Paul > return 1; > + } > > + /* We stole refcount owned by timer; > + * conntrack cannot go away. > + */ > spin_lock_bh(&nf_nat_lock); > hlist_del_rcu(&nat->bysource); > ct->status &= ~IPS_NAT_DONE_MASK; > @@ -553,6 +563,9 @@ static int nf_nat_proto_clean(struct nf_conn *ct, void *data) > > add_timer(&ct->timeout); > > + if (nf_nat_proto_remove(ct, data)) > + return 1; > + > /* don't delete conntrack. Although that would make things a lot > * simpler, we'd end up flushing all conntracks on nat rmmod. > */ > @@ -830,7 +843,8 @@ static void __net_exit nf_nat_net_exit(struct net *net) > struct nf_nat_proto_clean clean = {}; > > nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean, 0, 0); > - synchronize_rcu(); > + if (clean.need_sync_rcu) > + synchronize_rcu(); > nf_ct_free_hashtable(net->ct.nat_bysource, net->ct.nat_htable_size); > } > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-29 21:57 ` Eric W. Biederman 2014-08-29 23:52 ` Florian Westphal @ 2014-08-30 8:20 ` Julian Anastasov 1 sibling, 0 replies; 12+ messages in thread From: Julian Anastasov @ 2014-08-30 8:20 UTC (permalink / raw) To: Eric W. Biederman Cc: Simon Kirby, Paul E. McKenney, linux-kernel, netdev, Florian Westphal, Pablo Neira Ayuso Hello, On Fri, 29 Aug 2014, Eric W. Biederman wrote: > > I guess the problem is in nf_nat_net_exit, > > may be other nf exit handlers too. pernet-exit handlers > > should avoid synchronize_rcu and rcu_barrier. > > A RCU callback and rcu_barrier in module-exit is the way > > to go. cleanup_net includes rcu_barrier, so pernet-exit > > does not need such calls. > > In principle I agree, however in this particular case it looks a bit > tricky because a separate hash table to track nat state per network > namespace. It is still possible module's pernet-init handler to attach in net->ct... special structure with all pointers that should be freed by RCU callback for the module, like the hash table. For example: struct netns_ct_nat_rcu_allocs { struct rcu_head rcu_head; struct hlist_head *nat_bysource; unsigned int nat_htable_size; }; - pernet-init: - allocate structure, attach it to net->ct.nat_rcu_allocs - the original nat_bysource place remains because we want to avoid net->ct.nat_rcu_allocs dereference. - pernet-exit: - copy nat_bysource and nat_htable_size to net->ct.nat_rcu_allocs, this can be done even in above pernet-init function - call_rcu(&net->ct.nat_rcu_allocs->rcu_head, nat_rcu_free); - cleanup_net: - rcu_barrier() - RCU callback (nat_rcu_free): - call nf_ct_free_hashtable - kfree the structure - cleanup_net: - drop netns after rcu_barrier Due to the rcu_barrier in cleanup_net it is even possible to provide per-module rcu_head instead of using allocated structure, for example: call_rcu(&net->ct.nat_rcu_head, nat_rcu_free); Then the nat_rcu_free function will just call nf_ct_free_hashtable before cleanup_net drops the netns struct. In this case the memory price is just one rcu_head for every module that uses RCU callback. > At the same time all of the packets should be drained before > we get to nf_nat_net_exit so it doesn't look the synchronize_rcu > in nf_nat_exit is actually protecting anything. It is true for cleanup_net. I don't remember, can we see packets while the particular module-exit calls unregister_pernet_subsys(), may be yes? Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: net_ns cleanup / RCU overhead 2014-08-29 0:40 ` Simon Kirby 2014-08-29 3:57 ` Julian Anastasov @ 2014-08-30 2:52 ` Paul E. McKenney 1 sibling, 0 replies; 12+ messages in thread From: Paul E. McKenney @ 2014-08-30 2:52 UTC (permalink / raw) To: Simon Kirby; +Cc: Eric W. Biederman, linux-kernel, netdev On Thu, Aug 28, 2014 at 05:40:29PM -0700, Simon Kirby wrote: > On Thu, Aug 28, 2014 at 01:46:58PM -0700, Paul E. McKenney wrote: > > > On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote: > > > > > I just want to add a little bit more analysis to this. > > > > > > What we desire to be fast is the copy_net_ns, cleanup_net is batched and > > > asynchronous which nothing really cares how long it takes except that > > > cleanup_net holds the net_mutex and thus blocks copy_net_ns. > > > > > > The puzzle is why and which rcu delays Simon is seeing in the network > > > namespace cleanup path, as it seems like the synchronize_rcu is not > > > the only one, and in the case of vsftp with trivail network namespaces > > > where nothing has been done we should not need to delay. > > > > Indeed, given the version and .config, I can't see why any individual > > RCU grace-period operation would be particularly slow. > > > > I suggest using ftrace on synchronize_rcu() and friends. > > I made a parallel net namespace create/destroy benchmark that prints the > progress and time to create and cleanup 32 unshare()d child processes: > > http://0x.ca/sim/ref/tools/netnsbench.c > > I noticed that if I haven't run it for a while, the first batch often is > fast, followed by slowness from then on: > > ++++++++++++++++++++++++++++++++-------------------------------- 0.039478s > ++++++++++++++++++++-----+----------------+++++++++---------++-- 4.463837s > +++++++++++++++++++++++++------+--------------------++++++------ 3.011882s > +++++++++++++++---+-------------++++++++++++++++---------------- 2.283993s > > Fiddling around on a stock kernel, "echo 1 > /sys/kernel/rcu_expedited" > makes behaviour change as it did with my patch: > > ++-++-+++-+-----+-+-++-+-++--++-+--+-+-++--++-+-+-+-++-+--++---- 0.801406s > +-+-+-++-+-+-+-+-++--+-+-++-+--++-+-+-+-+-+-+-+-+-+-+-+--++-+--- 0.872011s > ++--+-++--+-++--+-++--+-+-+-+-++-+--++--+-++-+-+-+-+--++-+-+-+-- 0.946745s > > How would I use ftrace on synchronize_rcu() here? http://lwn.net/Articles/370423/ is your friend here. If your kernel is built with the needed configuration, you give the command "echo synchronize_rcu > set_ftrace_filter" http://lwn.net/Articles/365835/ and http://lwn.net/Articles/366796/ have background info. > As Eric said, cleanup_net() is batched, but while it is cleaning up, > net_mutex is held. Isn't the issue just that net_mutex is held while > some other things are going on that are meant to be lazy / batched? > > What is net_mutex protecting in cleanup_net()? > > I noticed that [kworker/u16:0]'s stack is often: > > [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 > [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 > [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] > [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60 > [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 > [<ffffffff81062997>] process_one_work+0x157/0x440 > [<ffffffff81063303>] worker_thread+0x63/0x520 > [<ffffffff81068b96>] kthread+0xd6/0xf0 > [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > and > > [<ffffffff81095364>] _rcu_barrier+0x154/0x1f0 > [<ffffffff81095450>] rcu_barrier+0x10/0x20 > [<ffffffff81102c2c>] kmem_cache_destroy+0x6c/0xb0 > [<ffffffffa0089e97>] nf_conntrack_cleanup_net_list+0x167/0x1c0 [nf_conntrack] > [<ffffffffa008aab5>] nf_conntrack_pernet_exit+0x65/0x70 [nf_conntrack] > [<ffffffff81720353>] ops_exit_list.isra.4+0x53/0x60 > [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 > [<ffffffff81062997>] process_one_work+0x157/0x440 > [<ffffffff81063303>] worker_thread+0x63/0x520 > [<ffffffff81068b96>] kthread+0xd6/0xf0 > [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > So I tried flushing iptables rules and rmmoding netfilter bits: > > ++++++++++++++++++++-+--------------------+++++++++++----------- 0.179940s > ++++++++++++++--+-------------+++++++++++++++++----------------- 0.151988s > ++++++++++++++++++++++++++++---+--------------------------+++--- 0.159967s > ++++++++++++++++++++++----------------------++++++++++---------- 0.175964s > > Expedited: > > ++-+--++-+-+-+-+-+-+--++-+-+-++-+-+-+--++-+-+-+-+-+-+-+-+-+-+--- 0.079988s > ++-+-+-+-+-+-+-+-+-+-+-+--++-+--++-+--+-++-+-+--++-+-+-+-+-+-+-- 0.089347s > ++++--+++--++--+-+++++++-+++++--------------++-+-+--++-+-+--++-- 0.081566s > +++++-+++-------++-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+--- 0.089026s > > So, much faster. It seems that just loading nf_conntrack_ipv4 (like by > running iptables -t nat -nvL) is enough to slow it way down. But it is > still capable of being fast, as above. My first guess is that this code sequence is calling synchronize_rcu() quite often. Would it be possible to consolidate these? Thanx, Paul ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2014-08-30 8:21 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-08-20 5:58 net_ns cleanup / RCU overhead Simon Kirby 2014-08-28 19:24 ` Paul E. McKenney 2014-08-28 19:44 ` Simon Kirby 2014-08-28 20:33 ` Eric W. Biederman 2014-08-28 20:46 ` Paul E. McKenney 2014-08-29 0:40 ` Simon Kirby 2014-08-29 3:57 ` Julian Anastasov 2014-08-29 21:57 ` Eric W. Biederman 2014-08-29 23:52 ` Florian Westphal 2014-08-30 2:56 ` Paul E. McKenney 2014-08-30 8:20 ` Julian Anastasov 2014-08-30 2:52 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).