From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Paul E. McKenney" Subject: Re: [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock Date: Wed, 1 Oct 2014 23:43:08 -0700 Message-ID: <20141002064308.GN5015@linux.vnet.ibm.com> References: <20140929115445.40221d8e@jlaw-desktop.mno.stratus.com> <20140929160601.GD15925@htj.dyndns.org> Reply-To: paulmck@linux.vnet.ibm.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Joe Lawrence , netdev@vger.kernel.org, Jiri Pirko To: Tejun Heo Return-path: Received: from e37.co.us.ibm.com ([32.97.110.158]:58919 "EHLO e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750934AbaJBGnO (ORCPT ); Thu, 2 Oct 2014 02:43:14 -0400 Received: from /spool/local by e37.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 2 Oct 2014 00:43:14 -0600 Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id 53BEC3E4003E for ; Thu, 2 Oct 2014 00:43:11 -0600 (MDT) Received: from d03av06.boulder.ibm.com (d03av06.boulder.ibm.com [9.17.195.245]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id s926hBc242467502 for ; Thu, 2 Oct 2014 08:43:11 +0200 Received: from d03av06.boulder.ibm.com (loopback [127.0.0.1]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id s926lkVb008681 for ; Thu, 2 Oct 2014 00:47:46 -0600 Content-Disposition: inline In-Reply-To: <20140929160601.GD15925@htj.dyndns.org> Sender: netdev-owner@vger.kernel.org List-ID: On Mon, Sep 29, 2014 at 12:06:01PM -0400, Tejun Heo wrote: > (cc'ing Paul and quoting the whole body) > > Paul, this is a fix for RCU sched stall observed w/ a work item > requeueing itself waiting for the RCU grace period. As the self > requeueing work item ends up being executed by the same kworker, the > worker task never stops running in the absence of a higher priority > task and it seems to delay RCU grace period for a very long time on > !PREEMPT kernels. As each work item denotes a boundary which no > synchronization construct stretches across, I wonder whether it'd be a > good idea to add a notification for the end of RCU critical section > between executions of work items. It sounds like a great idea to me! I suggest invoking rcu_note_context_switch() between executions of work items. Thanx, Paul > Thanks. > > On Mon, Sep 29, 2014 at 11:54:45AM -0400, Joe Lawrence wrote: > > Hello Jiri, > > > > I've been debugging a hang on RHEL7 that seems to originate in the > > teaming driver and the team_notify_peers_work/team_mcast_rejoin_work > > rtnl_trylock rescheduling logic. Running a stand-alone minimal driver > > mimicing the same schedule_delayed_work(.., 0) reproduces the problem on > > RHEL7 and upstream kernels [1]. > > > > A quick summary of the hang: > > > > 1 - systemd-udevd issues an ioctl that heads down dev_ioctl (grabs the > > rtnl_mutex), dev_ifsioc, dev_change_name and finally > > synchronize_sched. In every vmcore I've taken of the hang, this > > thread is waiting on the RCU. > > > > 2 - A kworker thread goes to 100% CPU. > > > > 3 - Inspecting the running thread on the CPU that rcusched reported as > > holding up the RCU grace period usually shows it in either > > team_notify_peers_work, team_mcast_rejoin_work, or somewhere in the > > workqueue code (process_one_work). This is the same CPU/thread as > > #2. > > > > 4 - team_notify_peers_work and team_mcast_rejoin_work want the rtnl_lock > > that systemd-udevd in #1 has, so they try to play nice by calling > > rtnl_trylock and rescheduling on failure. Unfortunately with 0 > > jiffy delay, process_one_work will "execute immediately" (ie, after > > others already in queue, but before the next tick). With the stock > > RHEL7 !CONFIG_PREEMPT at least, this creates a tight loop on > > process_one_work + rtnl_trylock that spins the CPU in #2. > > > > 5 - Sometime minutes later, RCU seems to be kicked by a side effect of > > a smp_apic_timer_interrupt. (This was the only other interesting > > function reported by ftrace function tracer). > > > > See the patch below for a potential workaround. Giving at least 1 jiffy > > should give process_one_work some breathing room before calling back > > into team_notify_peers_work/team_mcast_rejoin_work and attempting to > > acquire the rtnl_lock mutex. > > > > Regards, > > > > -- Joe > > > > [1] http://marc.info/?l=linux-kernel&m=141192244232345&w=2 > > > > -->8--- -->8--- -->8--- -->8--- > > > > From fc5bbf5771b5732f7479ac6e84bbfdde05710023 Mon Sep 17 00:00:00 2001 > > From: Joe Lawrence > > Date: Mon, 29 Sep 2014 11:09:05 -0400 > > Subject: [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock > > > > Give the CPU running the kworker handling team_notify_peers_work and > > team_mcast_rejoin_work functions some scheduling air by specifying a > > non-zero delay. > > > > Signed-off-by: Joe Lawrence > > --- > > drivers/net/team/team.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c > > index ef10302..d46df38 100644 > > --- a/drivers/net/team/team.c > > +++ b/drivers/net/team/team.c > > @@ -633,7 +633,7 @@ static void team_notify_peers_work(struct work_struct *work) > > team = container_of(work, struct team, notify_peers.dw.work); > > > > if (!rtnl_trylock()) { > > - schedule_delayed_work(&team->notify_peers.dw, 0); > > + schedule_delayed_work(&team->notify_peers.dw, 1); > > return; > > } > > call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, team->dev); > > @@ -673,7 +673,7 @@ static void team_mcast_rejoin_work(struct work_struct *work) > > team = container_of(work, struct team, mcast_rejoin.dw.work); > > > > if (!rtnl_trylock()) { > > - schedule_delayed_work(&team->mcast_rejoin.dw, 0); > > + schedule_delayed_work(&team->mcast_rejoin.dw, 1); > > return; > > } > > call_netdevice_notifiers(NETDEV_RESEND_IGMP, team->dev); > > -- > > 1.7.10.4 > > > > -- > tejun >