From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752500AbbCKUlQ (ORCPT <rfc822;w@1wt.eu>);
	Wed, 11 Mar 2015 16:41:16 -0400
Received: from e35.co.us.ibm.com ([32.97.110.153]:49195 "EHLO
	e35.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751307AbbCKUlO (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 11 Mar 2015 16:41:14 -0400
Date: Wed, 11 Mar 2015 13:41:08 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Sasha Levin <sasha.levin@oracle.com>
Cc: Josh Triplett <josh@joshtriplett.org>,
        Steven Rostedt <rostedt@goodmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Lai Jiangshan <laijs@cn.fujitsu.com>,
        LKML <linux-kernel@vger.kernel.org>, Dave Jones <davej@redhat.com>
Subject: Re: rcu: frequent rcu lockups
Message-ID: <20150311204108.GM5412@linux.vnet.ibm.com>
Reply-To: paulmck@linux.vnet.ibm.com
References: <55009E2C.7070100@oracle.com>
 <20150311201704.GL5412@linux.vnet.ibm.com>
 <5500A329.9060509@oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5500A329.9060509@oracle.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 15031120-0013-0000-0000-0000095B1DB7
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Mar 11, 2015 at 04:18:49PM -0400, Sasha Levin wrote:
> On 03/11/2015 04:17 PM, Paul E. McKenney wrote:
> > On Wed, Mar 11, 2015 at 03:57:32PM -0400, Sasha Levin wrote:
> >> Hi all,
> >>
> >> I've started seeing the following hang pretty often during my fuzzing. The
> >> system proceeds to lock up after that.
> >>
> >> [ 3209.655703] INFO: rcu_preempt detected stalls on CPUs/tasks:
> >> [ 3209.655703] 	Tasks blocked on level-1 rcu_node (CPUs 16-31):
> >> [ 3209.655703] 	(detected by 0, t=30502 jiffies, g=48799, c=48798, q=1730)
> >> [ 3209.655703] All QSes seen, last rcu_preempt kthread activity 1 (4295246069-4295246068), jiffies_till_next_fqs=1, root ->qsmask 0x2
> >> [ 3209.655703] trinity-c24     R  running task    26944  9338   9110 0x10080000
> >> [ 3209.655703]  0000000000002396 00000000e68fa48e ffff880050607dc8 ffffffffa427679b
> >> [ 3209.655703]  ffff880050607d98 ffffffffb1b36000 0000000000000001 00000001000440f4
> >> [ 3209.655703]  ffffffffb1b351c8 dffffc0000000000 ffff880050622000 ffffffffb1721000
> >> [ 3209.655703] Call Trace:
> >> [ 3209.655703] <IRQ> sched_show_task (kernel/sched/core.c:4542)
> >> [ 3209.655703] rcu_check_callbacks (kernel/rcu/tree.c:1225 kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3389 kernel/rcu/tree.c:3453 kernel/rcu/tree.c:2683)
> >> [ 3209.655703] ? acct_account_cputime (kernel/tsacct.c:168)
> >> [ 3209.655703] update_process_times (./arch/x86/include/asm/preempt.h:22 kernel/time/timer.c:1386)
> >> [ 3209.655703] tick_periodic (kernel/time/tick-common.c:92)
> >> [ 3209.655703] ? tick_handle_periodic (kernel/time/tick-common.c:105)
> >> [ 3209.655703] tick_handle_periodic (kernel/time/tick-common.c:105)
> >> [ 3209.655703] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891)
> >> [ 3209.655703] smp_trace_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:934 include/linux/jump_label.h:114 ./arch/x86/include/asm/trace/irq_vectors.h:45 arch/x86/kernel/apic/apic.c:935)
> >> [ 3209.655703] trace_apic_timer_interrupt (arch/x86/kernel/entry_64.S:920)
> >> [ 3209.655703] <EOI> ? add_wait_queue (include/linux/wait.h:116 kernel/sched/wait.c:29)
> >> [ 3209.655703] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:162 kernel/locking/spinlock.c:191)
> >> [ 3209.655703] add_wait_queue (kernel/sched/wait.c:31)
> >> [ 3209.655703] do_wait (kernel/exit.c:1473)
> >> [ 3209.655703] ? trace_rcu_dyntick (include/trace/events/rcu.h:363 (discriminator 19))
> >> [ 3209.655703] ? wait_consider_task (kernel/exit.c:1465)
> >> [ 3209.655703] ? find_get_pid (kernel/pid.c:490)
> >> [ 3209.655703] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586)
> >> [ 3209.655703] ? perf_syscall_exit (kernel/trace/trace_syscalls.c:549)
> >> [ 3209.655703] ? SyS_waitid (kernel/exit.c:1586)
> >> [ 3209.655703] ? kill_orphaned_pgrp (kernel/exit.c:1444)
> >> [ 3209.655703] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1592)
> >> [ 3209.655703] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
> >> [ 3209.655703] tracesys_phase2 (arch/x86/kernel/entry_64.S:347)
> > 
> > OK, that is not good.
> > 
> > What version are you running and what is your .config?
> 
> latest -next. .config attached.

Aha, I forgot to update rcu/next.  I have now updated it, so it should
make it there today or tomorrow.  In the meantime, does the following
commit help?

Also, how quickly does your test setup reproduce this?

							Thanx, Paul

------------------------------------------------------------------------

rcu: Yet another fix for preemption and CPU hotplug

As noted earlier, the following sequence of events can occur when
running PREEMPT_RCU and HOTPLUG_CPU on a system with a multi-level
rcu_node combining tree:

1.	A group of tasks block on CPUs corresponding to a given leaf
    	rcu_node structure while within RCU read-side critical sections.
2.	All CPUs corrsponding to that rcu_node structure go offline.
3.	The next grace period starts, but because there are still tasks
    	blocked, the upper-level bits corresponding to this leaf rcu_node
    	structure remain set.
4.	All the tasks exit their RCU read-side critical sections and
    	remove themselves from the leaf rcu_node structure's list,
    	leaving it empty.
5.	But because there now is code to check for this condition at
    	force-quiescent-state time, the upper bits are cleared and the
    	grace period completes.
    
However, there is another complication that can occur following step 4 above:
    
4a.	The grace period starts, and the leaf rcu_node structure's
    	gp_tasks pointer is set to NULL because there are no tasks
    	blocked on this structure.
4b.	One of the CPUs corresponding to the leaf rcu_node structure
    	comes back online.
4b.	An endless stream of tasks are preempted within RCU read-side
    	critical sections on this CPU, such that the ->blkd_tasks
    	list is always non-empty.
    
The grace period will never end.

This commit therefore makes the force-quiescent-state processing check only
for absence of tasks blocking the current grace period rather than absence
of tasks altogether.  This will cause a quiescent state to be reported if
the current leaf rcu_node structure is not blocking the current grace period
and its parent thinks that it is, regardless of how RCU managed to get
itself into this state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 4.0.x
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cc2e9bebf585..f9a56523e8dd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2212,8 +2212,8 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
 	unsigned long mask;
 	struct rcu_node *rnp_p;
 
-	WARN_ON_ONCE(rsp == &rcu_bh_state || rsp == &rcu_sched_state);
-	if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {
+	if (rcu_state_p == &rcu_sched_state || rsp != rcu_state_p ||
+	    rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 		return;  /* Still need more quiescent states! */
 	}
@@ -2221,9 +2221,8 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
 	rnp_p = rnp->parent;
 	if (rnp_p == NULL) {
 		/*
-		 * Either there is only one rcu_node in the tree,
-		 * or tasks were kicked up to root rcu_node due to
-		 * CPUs going offline.
+		 * Only one rcu_node structure in the tree, so don't
+		 * try to report up to its nonexistent parent!
 		 */
 		rcu_report_qs_rsp(rsp, flags);
 		return;
@@ -2715,8 +2714,29 @@ static void force_qs_rnp(struct rcu_state *rsp,
 			return;
 		}
 		if (rnp->qsmask == 0) {
-			rcu_initiate_boost(rnp, flags); /* releases rnp->lock */
-			continue;
+			if (rcu_state_p == &rcu_sched_state ||
+			    rsp != rcu_state_p ||
+			    rcu_preempt_blocked_readers_cgp(rnp)) {
+				/*
+				 * No point in scanning bits because they
+				 * are all zero.  But we might need to
+				 * priority-boost blocked readers.
+				 */
+				rcu_initiate_boost(rnp, flags);
+				/* rcu_initiate_boost() releases rnp->lock */
+				continue;
+			}
+			if (rnp->parent &&
+			    (rnp->parent->qsmask & rnp->grpmask)) {
+				/*
+				 * Race between grace-period
+				 * initialization and task exiting RCU
+				 * read-side critical section: Report.
+				 */
+				rcu_report_unblock_qs_rnp(rsp, rnp, flags);
+				/* rcu_report_unblock_qs_rnp() rlses ->lock */
+				continue;
+			}
 		}
 		cpu = rnp->grplo;
 		bit = 1;
@@ -2731,15 +2751,6 @@ static void force_qs_rnp(struct rcu_state *rsp,
 		if (mask != 0) {
 			/* Idle/offline CPUs, report. */
 			rcu_report_qs_rnp(mask, rsp, rnp, flags);
-		} else if (rnp->parent &&
-			 list_empty(&rnp->blkd_tasks) &&
-			 !rnp->qsmask &&
-			 (rnp->parent->qsmask & rnp->grpmask)) {
-			/*
-			 * Race between grace-period initialization and task
-			 * existing RCU read-side critical section, report.
-			 */
-			rcu_report_unblock_qs_rnp(rsp, rnp, flags);
 		} else {
 			/* Nothing to do here, so just drop the lock. */
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);