From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752500AbbCKUlQ (ORCPT ); Wed, 11 Mar 2015 16:41:16 -0400 Received: from e35.co.us.ibm.com ([32.97.110.153]:49195 "EHLO e35.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751307AbbCKUlO (ORCPT ); Wed, 11 Mar 2015 16:41:14 -0400 Date: Wed, 11 Mar 2015 13:41:08 -0700 From: "Paul E. McKenney" To: Sasha Levin Cc: Josh Triplett , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , LKML , Dave Jones Subject: Re: rcu: frequent rcu lockups Message-ID: <20150311204108.GM5412@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <55009E2C.7070100@oracle.com> <20150311201704.GL5412@linux.vnet.ibm.com> <5500A329.9060509@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5500A329.9060509@oracle.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15031120-0013-0000-0000-0000095B1DB7 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 11, 2015 at 04:18:49PM -0400, Sasha Levin wrote: > On 03/11/2015 04:17 PM, Paul E. McKenney wrote: > > On Wed, Mar 11, 2015 at 03:57:32PM -0400, Sasha Levin wrote: > >> Hi all, > >> > >> I've started seeing the following hang pretty often during my fuzzing. The > >> system proceeds to lock up after that. > >> > >> [ 3209.655703] INFO: rcu_preempt detected stalls on CPUs/tasks: > >> [ 3209.655703] Tasks blocked on level-1 rcu_node (CPUs 16-31): > >> [ 3209.655703] (detected by 0, t=30502 jiffies, g=48799, c=48798, q=1730) > >> [ 3209.655703] All QSes seen, last rcu_preempt kthread activity 1 (4295246069-4295246068), jiffies_till_next_fqs=1, root ->qsmask 0x2 > >> [ 3209.655703] trinity-c24 R running task 26944 9338 9110 0x10080000 > >> [ 3209.655703] 0000000000002396 00000000e68fa48e ffff880050607dc8 ffffffffa427679b > >> [ 3209.655703] ffff880050607d98 ffffffffb1b36000 0000000000000001 00000001000440f4 > >> [ 3209.655703] ffffffffb1b351c8 dffffc0000000000 ffff880050622000 ffffffffb1721000 > >> [ 3209.655703] Call Trace: > >> [ 3209.655703] sched_show_task (kernel/sched/core.c:4542) > >> [ 3209.655703] rcu_check_callbacks (kernel/rcu/tree.c:1225 kernel/rcu/tree.c:1331 kernel/rcu/tree.c:3389 kernel/rcu/tree.c:3453 kernel/rcu/tree.c:2683) > >> [ 3209.655703] ? acct_account_cputime (kernel/tsacct.c:168) > >> [ 3209.655703] update_process_times (./arch/x86/include/asm/preempt.h:22 kernel/time/timer.c:1386) > >> [ 3209.655703] tick_periodic (kernel/time/tick-common.c:92) > >> [ 3209.655703] ? tick_handle_periodic (kernel/time/tick-common.c:105) > >> [ 3209.655703] tick_handle_periodic (kernel/time/tick-common.c:105) > >> [ 3209.655703] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:891) > >> [ 3209.655703] smp_trace_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:934 include/linux/jump_label.h:114 ./arch/x86/include/asm/trace/irq_vectors.h:45 arch/x86/kernel/apic/apic.c:935) > >> [ 3209.655703] trace_apic_timer_interrupt (arch/x86/kernel/entry_64.S:920) > >> [ 3209.655703] ? add_wait_queue (include/linux/wait.h:116 kernel/sched/wait.c:29) > >> [ 3209.655703] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:162 kernel/locking/spinlock.c:191) > >> [ 3209.655703] add_wait_queue (kernel/sched/wait.c:31) > >> [ 3209.655703] do_wait (kernel/exit.c:1473) > >> [ 3209.655703] ? trace_rcu_dyntick (include/trace/events/rcu.h:363 (discriminator 19)) > >> [ 3209.655703] ? wait_consider_task (kernel/exit.c:1465) > >> [ 3209.655703] ? find_get_pid (kernel/pid.c:490) > >> [ 3209.655703] SyS_wait4 (kernel/exit.c:1618 kernel/exit.c:1586) > >> [ 3209.655703] ? perf_syscall_exit (kernel/trace/trace_syscalls.c:549) > >> [ 3209.655703] ? SyS_waitid (kernel/exit.c:1586) > >> [ 3209.655703] ? kill_orphaned_pgrp (kernel/exit.c:1444) > >> [ 3209.655703] ? syscall_trace_enter_phase2 (arch/x86/kernel/ptrace.c:1592) > >> [ 3209.655703] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42) > >> [ 3209.655703] tracesys_phase2 (arch/x86/kernel/entry_64.S:347) > > > > OK, that is not good. > > > > What version are you running and what is your .config? > > latest -next. .config attached. Aha, I forgot to update rcu/next. I have now updated it, so it should make it there today or tomorrow. In the meantime, does the following commit help? Also, how quickly does your test setup reproduce this? Thanx, Paul ------------------------------------------------------------------------ rcu: Yet another fix for preemption and CPU hotplug As noted earlier, the following sequence of events can occur when running PREEMPT_RCU and HOTPLUG_CPU on a system with a multi-level rcu_node combining tree: 1. A group of tasks block on CPUs corresponding to a given leaf rcu_node structure while within RCU read-side critical sections. 2. All CPUs corrsponding to that rcu_node structure go offline. 3. The next grace period starts, but because there are still tasks blocked, the upper-level bits corresponding to this leaf rcu_node structure remain set. 4. All the tasks exit their RCU read-side critical sections and remove themselves from the leaf rcu_node structure's list, leaving it empty. 5. But because there now is code to check for this condition at force-quiescent-state time, the upper bits are cleared and the grace period completes. However, there is another complication that can occur following step 4 above: 4a. The grace period starts, and the leaf rcu_node structure's gp_tasks pointer is set to NULL because there are no tasks blocked on this structure. 4b. One of the CPUs corresponding to the leaf rcu_node structure comes back online. 4b. An endless stream of tasks are preempted within RCU read-side critical sections on this CPU, such that the ->blkd_tasks list is always non-empty. The grace period will never end. This commit therefore makes the force-quiescent-state processing check only for absence of tasks blocking the current grace period rather than absence of tasks altogether. This will cause a quiescent state to be reported if the current leaf rcu_node structure is not blocking the current grace period and its parent thinks that it is, regardless of how RCU managed to get itself into this state. Signed-off-by: Paul E. McKenney Cc: # 4.0.x diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index cc2e9bebf585..f9a56523e8dd 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2212,8 +2212,8 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp, unsigned long mask; struct rcu_node *rnp_p; - WARN_ON_ONCE(rsp == &rcu_bh_state || rsp == &rcu_sched_state); - if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) { + if (rcu_state_p == &rcu_sched_state || rsp != rcu_state_p || + rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) { raw_spin_unlock_irqrestore(&rnp->lock, flags); return; /* Still need more quiescent states! */ } @@ -2221,9 +2221,8 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp, rnp_p = rnp->parent; if (rnp_p == NULL) { /* - * Either there is only one rcu_node in the tree, - * or tasks were kicked up to root rcu_node due to - * CPUs going offline. + * Only one rcu_node structure in the tree, so don't + * try to report up to its nonexistent parent! */ rcu_report_qs_rsp(rsp, flags); return; @@ -2715,8 +2714,29 @@ static void force_qs_rnp(struct rcu_state *rsp, return; } if (rnp->qsmask == 0) { - rcu_initiate_boost(rnp, flags); /* releases rnp->lock */ - continue; + if (rcu_state_p == &rcu_sched_state || + rsp != rcu_state_p || + rcu_preempt_blocked_readers_cgp(rnp)) { + /* + * No point in scanning bits because they + * are all zero. But we might need to + * priority-boost blocked readers. + */ + rcu_initiate_boost(rnp, flags); + /* rcu_initiate_boost() releases rnp->lock */ + continue; + } + if (rnp->parent && + (rnp->parent->qsmask & rnp->grpmask)) { + /* + * Race between grace-period + * initialization and task exiting RCU + * read-side critical section: Report. + */ + rcu_report_unblock_qs_rnp(rsp, rnp, flags); + /* rcu_report_unblock_qs_rnp() rlses ->lock */ + continue; + } } cpu = rnp->grplo; bit = 1; @@ -2731,15 +2751,6 @@ static void force_qs_rnp(struct rcu_state *rsp, if (mask != 0) { /* Idle/offline CPUs, report. */ rcu_report_qs_rnp(mask, rsp, rnp, flags); - } else if (rnp->parent && - list_empty(&rnp->blkd_tasks) && - !rnp->qsmask && - (rnp->parent->qsmask & rnp->grpmask)) { - /* - * Race between grace-period initialization and task - * existing RCU read-side critical section, report. - */ - rcu_report_unblock_qs_rnp(rsp, rnp, flags); } else { /* Nothing to do here, so just drop the lock. */ raw_spin_unlock_irqrestore(&rnp->lock, flags);