From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S938683AbcKLAYI (ORCPT ); Fri, 11 Nov 2016 19:24:08 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:42432 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933797AbcKLAYF (ORCPT ); Fri, 11 Nov 2016 19:24:05 -0500 Date: Fri, 11 Nov 2016 16:23:47 -0800 From: "Paul E. McKenney" To: Rolf Neugebauer Cc: Cong Wang , LKML , Linux Kernel Network Developers , Justin Cormack , Ian Campbell Subject: Re: Long delays creating a netns after deleting one (possibly RCU related) Reply-To: paulmck@linux.vnet.ibm.com References: <20161110212404.GB4127@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16111200-0004-0000-0000-000010D80E54 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006060; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000189; SDB=6.00779760; UDB=6.00375735; IPR=6.00557080; BA=6.00004873; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013300; XFM=3.00000011; UTC=2016-11-12 00:23:51 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16111200-0005-0000-0000-00007A868B43 Message-Id: <20161112002347.GL4127@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-11-11_11:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1611120005 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 11, 2016 at 01:11:01PM +0000, Rolf Neugebauer wrote: > On Thu, Nov 10, 2016 at 9:24 PM, Paul E. McKenney > wrote: > > On Thu, Nov 10, 2016 at 09:37:47AM -0800, Cong Wang wrote: > >> (Cc'ing Paul) > >> > >> On Wed, Nov 9, 2016 at 7:42 AM, Rolf Neugebauer > >> wrote: > >> > Hi > >> > > >> > We noticed some long delays starting docker containers on some newer > >> > kernels (starting with 4.5.x and still present in 4.9-rc4, 4.4.x is > >> > fine). We narrowed this down to the creation of a network namespace > >> > being delayed directly after removing another one (details and > >> > reproduction below). We have seen delays of up to 60s on some systems. > >> > > >> > - The delay is proportional to the number of CPUs (online or offline). > >> > We first discovered it with a Hyper-V Linux VM. Hyper-V advertises up > >> > to 240 offline vCPUs even if one configures the VM with only, say 2 > >> > vCPUs. We see linear increase in delay when we change NR_CPUS in the > >> > kernel config. > >> > > >> > - The delay is also dependent on some tunnel network interfaces being > >> > present (which we had compiled in in one of our kernel configs). > >> > > >> > - We can reproduce this issue with stock kernels from > >> > http://kernel.ubuntu.com/~kernel-ppa/mainline/running in Hyper-V VMs > >> > as well as other hypervisors like qemu and hyperkit where we have good > >> > control over the number of CPUs. > >> > > >> > A simple test is: > >> > modprobe ipip > >> > moprobe ip_gre > >> > modprobe ip_vti > >> > echo -n "add netns foo ===> "; /usr/bin/time -f "%E" ip netns add foo > >> > echo -n "del netns foo ===> "; /usr/bin/time -f "%E" ip netns delete foo > >> > echo -n "add netns bar ===> "; /usr/bin/time -f "%E" ip netns add bar > >> > echo -n "del netns bar ===> "; /usr/bin/time -f "%E" ip netns delete bar > >> > > >> > with an output like: > >> > add netns foo ===> 0:00.00 > >> > del netns foo ===> 0:00.01 > >> > add netns bar ===> 0:08.53 > >> > del netns bar ===> 0:00.01 > >> > > >> > This is on a 4.9-rc4 kernel from the above URL configured with > >> > NR_CPUS=256 running in a Hyper-V VM (kernel config attached). > >> > > >> > Below is a dump of the work queues while the second 'ip add netns' is > >> > hanging. The state of the work queues does not seem to change while > >> > the command is delayed and the pattern shown is consistent across > >> > different kernel versions. > >> > > >> > Is this a known issue and/or is someone working on a fix? > >> > >> Not to me. > >> > >> > >> > > >> > [ 610.356272] sysrq: SysRq : Show Blocked State > >> > [ 610.356742] task PC stack pid father > >> > [ 610.357252] kworker/u480:1 D 0 1994 2 0x00000000 > >> > [ 610.357752] Workqueue: netns cleanup_net > >> > [ 610.358239] ffff9892f1065800 0000000000000000 ffff9892ee1e1e00 > >> > ffff9892f8e59340 > >> > [ 610.358705] ffff9892f4526900 ffffbf0104b5ba88 ffffffffbe486df3 > >> > ffffbf0104b5ba60 > >> > [ 610.359168] 00ffffffbdcbe663 ffff9892f8e59340 0000000100012e70 > >> > ffff9892ee1e1e00 > >> > [ 610.359677] Call Trace: > >> > [ 610.360169] [] ? __schedule+0x233/0x6e0 > >> > [ 610.360723] [] schedule+0x36/0x80 > >> > [ 610.361194] [] schedule_timeout+0x22a/0x3f0 > >> > [ 610.361789] [] ? __schedule+0x23b/0x6e0 > >> > [ 610.362260] [] wait_for_completion+0xb4/0x140 > >> > [ 610.362736] [] ? wake_up_q+0x80/0x80 > >> > [ 610.363306] [] __wait_rcu_gp+0xc8/0xf0 > >> > [ 610.363782] [] synchronize_sched+0x5c/0x80 > >> > [ 610.364137] [] ? call_rcu_bh+0x20/0x20 > >> > [ 610.364742] [] ? > >> > trace_raw_output_rcu_utilization+0x60/0x60 > >> > [ 610.365337] [] synchronize_net+0x1c/0x30 > >> > >> This is a worker which holds the net_mutex and is waiting for > >> a RCU grace period to elapse. Ah! This net_mutex is different than RTNL. Should synchronize_net() be modified to check for net_mutex being held in addition to the current checks for RTNL being held? Thanx, Paul > >> > [ 610.365846] [] netif_napi_del+0x23/0x80 > >> > [ 610.367494] [] ip_tunnel_dev_free+0x68/0xf0 [ip_tunnel] > >> > [ 610.368007] [] netdev_run_todo+0x230/0x330 > >> > [ 610.368454] [] rtnl_unlock+0xe/0x10 > >> > [ 610.369001] [] ip_tunnel_delete_net+0xdf/0x120 [ip_tunnel] > >> > [ 610.369500] [] ipip_exit_net+0x2c/0x30 [ipip] > >> > [ 610.369997] [] ops_exit_list.isra.4+0x38/0x60 > >> > [ 610.370636] [] cleanup_net+0x1c4/0x2b0 > >> > [ 610.371130] [] process_one_work+0x1fc/0x4b0 > >> > [ 610.371812] [] worker_thread+0x4b/0x500 > >> > [ 610.373074] [] ? process_one_work+0x4b0/0x4b0 > >> > [ 610.373622] [] ? process_one_work+0x4b0/0x4b0 > >> > [ 610.374100] [] kthread+0xd9/0xf0 > >> > [ 610.374574] [] ? kthread_park+0x60/0x60 > >> > [ 610.375198] [] ret_from_fork+0x25/0x30 > >> > [ 610.375678] ip D 0 2149 2148 0x00000000 > >> > [ 610.376185] ffff9892f0a99000 0000000000000000 ffff9892f0a66900 > >> > [ 610.376185] ffff9892f8e59340 > >> > [ 610.376185] ffff9892f4526900 ffffbf0101173db8 ffffffffbe486df3 > >> > [ 610.376753] 00000005fecffd76 > >> > [ 610.376762] 00ff9892f11d9820 ffff9892f8e59340 ffff989200000000 > >> > ffff9892f0a66900 > >> > [ 610.377274] Call Trace: > >> > [ 610.377789] [] ? __schedule+0x233/0x6e0 > >> > [ 610.378306] [] schedule+0x36/0x80 > >> > [ 610.378992] [] schedule_preempt_disabled+0xe/0x10 > >> > [ 610.379514] [] __mutex_lock_slowpath+0xb9/0x130 > >> > [ 610.380031] [] ? __kmalloc+0x162/0x1e0 > >> > [ 610.380556] [] mutex_lock+0x1f/0x30 > >> > [ 610.381135] [] copy_net_ns+0x9f/0x170 > >> > [ 610.381647] [] create_new_namespaces+0x11b/0x200 > >> > [ 610.382249] [] unshare_nsproxy_namespaces+0x5a/0xb0 > >> > [ 610.382818] [] SyS_unshare+0x1cd/0x360 > >> > [ 610.383319] [] entry_SYSCALL_64_fastpath+0x1e/0xad > >> > >> This process is apparently waiting for the net_mutex held by the previous one. > >> > >> Either RCU implementation is broken or something else is missing. > >> Do you have more stack traces of related processes? For example, > >> rcu_tasks_kthread. And if anything you can help to narrow down the problem, > >> it would be great. > > > > Did you set the rcu_normal boot parameter? Doing so would have this effect. > > > > (It is intended for real-time users who don't like expedited grace periods.) > > rcu_normal is not set on the kernel command line and > /sys/kernel/rcu_normal and /sys/kernel/rcu_expedited both show 0. > > > > > > Thanx, Paul > > >