From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751041AbaH3Cwc (ORCPT ); Fri, 29 Aug 2014 22:52:32 -0400 Received: from e39.co.us.ibm.com ([32.97.110.160]:50327 "EHLO e39.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750878AbaH3Cwb (ORCPT ); Fri, 29 Aug 2014 22:52:31 -0400 Date: Fri, 29 Aug 2014 19:52:25 -0700 From: "Paul E. McKenney" To: Simon Kirby Cc: "Eric W. Biederman" , linux-kernel@vger.kernel.org, netdev@vger.kernel.org Subject: Re: net_ns cleanup / RCU overhead Message-ID: <20140830025225.GC5001@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20140820055855.GB5579@hostway.ca> <20140828192431.GF5001@linux.vnet.ibm.com> <20140828194422.GB8867@hostway.ca> <87oav4l5g9.fsf@x220.int.ebiederm.org> <20140828204658.GL5001@linux.vnet.ibm.com> <20140829004029.GA18300@hostway.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140829004029.GA18300@hostway.ca> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 14083002-9332-0000-0000-000001DA9C0B Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 28, 2014 at 05:40:29PM -0700, Simon Kirby wrote: > On Thu, Aug 28, 2014 at 01:46:58PM -0700, Paul E. McKenney wrote: > > > On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote: > > > > > I just want to add a little bit more analysis to this. > > > > > > What we desire to be fast is the copy_net_ns, cleanup_net is batched and > > > asynchronous which nothing really cares how long it takes except that > > > cleanup_net holds the net_mutex and thus blocks copy_net_ns. > > > > > > The puzzle is why and which rcu delays Simon is seeing in the network > > > namespace cleanup path, as it seems like the synchronize_rcu is not > > > the only one, and in the case of vsftp with trivail network namespaces > > > where nothing has been done we should not need to delay. > > > > Indeed, given the version and .config, I can't see why any individual > > RCU grace-period operation would be particularly slow. > > > > I suggest using ftrace on synchronize_rcu() and friends. > > I made a parallel net namespace create/destroy benchmark that prints the > progress and time to create and cleanup 32 unshare()d child processes: > > http://0x.ca/sim/ref/tools/netnsbench.c > > I noticed that if I haven't run it for a while, the first batch often is > fast, followed by slowness from then on: > > ++++++++++++++++++++++++++++++++-------------------------------- 0.039478s > ++++++++++++++++++++-----+----------------+++++++++---------++-- 4.463837s > +++++++++++++++++++++++++------+--------------------++++++------ 3.011882s > +++++++++++++++---+-------------++++++++++++++++---------------- 2.283993s > > Fiddling around on a stock kernel, "echo 1 > /sys/kernel/rcu_expedited" > makes behaviour change as it did with my patch: > > ++-++-+++-+-----+-+-++-+-++--++-+--+-+-++--++-+-+-+-++-+--++---- 0.801406s > +-+-+-++-+-+-+-+-++--+-+-++-+--++-+-+-+-+-+-+-+-+-+-+-+--++-+--- 0.872011s > ++--+-++--+-++--+-++--+-+-+-+-++-+--++--+-++-+-+-+-+--++-+-+-+-- 0.946745s > > How would I use ftrace on synchronize_rcu() here? http://lwn.net/Articles/370423/ is your friend here. If your kernel is built with the needed configuration, you give the command "echo synchronize_rcu > set_ftrace_filter" http://lwn.net/Articles/365835/ and http://lwn.net/Articles/366796/ have background info. > As Eric said, cleanup_net() is batched, but while it is cleaning up, > net_mutex is held. Isn't the issue just that net_mutex is held while > some other things are going on that are meant to be lazy / batched? > > What is net_mutex protecting in cleanup_net()? > > I noticed that [kworker/u16:0]'s stack is often: > > [] wait_rcu_gp+0x46/0x50 > [] synchronize_sched+0x2e/0x50 > [] nf_nat_net_exit+0x2c/0x50 [nf_nat] > [] ops_exit_list.isra.4+0x39/0x60 > [] cleanup_net+0xf0/0x1a0 > [] process_one_work+0x157/0x440 > [] worker_thread+0x63/0x520 > [] kthread+0xd6/0xf0 > [] ret_from_fork+0x7c/0xb0 > [] 0xffffffffffffffff > > and > > [] _rcu_barrier+0x154/0x1f0 > [] rcu_barrier+0x10/0x20 > [] kmem_cache_destroy+0x6c/0xb0 > [] nf_conntrack_cleanup_net_list+0x167/0x1c0 [nf_conntrack] > [] nf_conntrack_pernet_exit+0x65/0x70 [nf_conntrack] > [] ops_exit_list.isra.4+0x53/0x60 > [] cleanup_net+0xf0/0x1a0 > [] process_one_work+0x157/0x440 > [] worker_thread+0x63/0x520 > [] kthread+0xd6/0xf0 > [] ret_from_fork+0x7c/0xb0 > [] 0xffffffffffffffff > > So I tried flushing iptables rules and rmmoding netfilter bits: > > ++++++++++++++++++++-+--------------------+++++++++++----------- 0.179940s > ++++++++++++++--+-------------+++++++++++++++++----------------- 0.151988s > ++++++++++++++++++++++++++++---+--------------------------+++--- 0.159967s > ++++++++++++++++++++++----------------------++++++++++---------- 0.175964s > > Expedited: > > ++-+--++-+-+-+-+-+-+--++-+-+-++-+-+-+--++-+-+-+-+-+-+-+-+-+-+--- 0.079988s > ++-+-+-+-+-+-+-+-+-+-+-+--++-+--++-+--+-++-+-+--++-+-+-+-+-+-+-- 0.089347s > ++++--+++--++--+-+++++++-+++++--------------++-+-+--++-+-+--++-- 0.081566s > +++++-+++-------++-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+--- 0.089026s > > So, much faster. It seems that just loading nf_conntrack_ipv4 (like by > running iptables -t nat -nvL) is enough to slow it way down. But it is > still capable of being fast, as above. My first guess is that this code sequence is calling synchronize_rcu() quite often. Would it be possible to consolidate these? Thanx, Paul