From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751702Ab2AYOaR (ORCPT ); Wed, 25 Jan 2012 09:30:17 -0500 Received: from e3.ny.us.ibm.com ([32.97.182.143]:38183 "EHLO e3.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750892Ab2AYOaP (ORCPT ); Wed, 25 Jan 2012 09:30:15 -0500 Date: Wed, 25 Jan 2012 06:00:29 -0800 From: "Paul E. McKenney" To: Daniel J Blueman Cc: Dipankar Sarma , Steffen Persvold , linux-kernel@vger.kernel.org, x86@kernel.org Subject: Re: RCU qsmask !=0 warnings on large-SMP... Message-ID: <20120125140029.GA2534@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <4F1FCF02.9060209@numascale-asia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F1FCF02.9060209@numascale-asia.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12012514-8974-0000-0000-0000058B915F Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 25, 2012 at 05:44:34PM +0800, Daniel J Blueman wrote: > During validation of an interconnect which extends the AMD Opteron > HyperTransport to many servers (Numascale NumaConnect), we observe > RCU warnings [1] when larger systems are under specific workloads, > and subsequent hanging. > > This is reproducable via RCU-intensive activities, eg the rcutorture > module (let me know if interested in another reproducer 'bigmap' > that hits the spot) on eg stock 3.1.8 and 3.2.1. Reproducing on a > subset of cores for clarity and with further debug [2], we observe > struct rcu_node's qsmask bitfield not being cleared on typically > three tree nodes when expected to be [3] and we see the RCU grace > periods stop incrementing. I do test three-level trees regularly, but on relatively small systems by limiting CONFIG_RCU_FANOUT. If you are running 64-bit systems with default values of CONFIG_RCU_FANOUT, a 1024-CPU system fits into a two-level tree, so I would guess that you either have more than 1024 CPUs (at least as far as NR_CPUS is concerned) or gaps in your CPU-numbering scheme. > Has this been encountered previously? I haven't seen that warning since I was originally developing TREE_PREEMPT_RCU some years back. Numascale NumaConnect is cache-coherent, right? > Many thanks, > Daniel > > --- [1] > > WARNING: at kernel/rcutree_plugin.h:990 OK, if I have the right version, RCU-sched is trying to do a new grace period, but finds that one of the rcu_node structure's ->qsmask fields indicates that a CPU (or group of CPUs) failed to check into the previous grace period. > rcu_preempt_check_blocked_tasks+0x27/0x30() > Hardware name: ProLiant DL165 G6 > Modules linked in: > Pid: 21880, comm: bigmap Not tainted 3.1.8-numaconnect10+ #5 > Call Trace: > [] ? rcu_preempt_check_blocked_tasks+0x27/0x30 > [] warn_slowpath_common+0x8b/0xc0 > [] ? perf_event_task_tick+0x71/0xa0 > [] warn_slowpath_null+0x15/0x20 > [] rcu_preempt_check_blocked_tasks+0x27/0x30 > [] rcu_start_gp+0x163/0x200 > [] __rcu_process_callbacks+0x8b/0xd0 > [] rcu_process_callbacks+0x20/0x50 > [] __do_softirq+0x9d/0x140 > [] call_softirq+0x1c/0x30 > [] do_softirq+0x4a/0x80 > [] irq_exit+0xa5/0xb0 > [] smp_apic_timer_interrupt+0x45/0x60 > [] apic_timer_interrupt+0x6b/0x70 > [] ? call_rcu_sched+0x10/0x20 > [] ? fsnotify+0x70/0x2d0 > [] ? iput_final+0xe3/0x150 > [] ? kmem_cache_free+0x15/0xa0 > [] ? vfsmount_lock_global_lock_online+0x4a/0xc0 > [] __fput+0x17d/0x1f0 > [] fput+0x17/0x20 > [] remove_vma+0x36/0x70 > [] exit_mmap+0xee/0x120 > [] mmput+0x44/0xf0 > [] exit_mm+0x109/0x140 > [] do_exit+0x1b6/0x3d0 > [] ? __dequeue_signal+0x5b/0xa0 > [] do_group_exit+0x3c/0xa0 > [] get_signal_to_deliver+0x1a0/0x320 > [] do_signal+0x4c/0x120 > [] ? kill_something_info+0x3d/0x130 > [] ? sys_kill+0x7b/0x90 > [] do_notify_resume+0x6d/0x80 > [] int_signal+0x12/0x17 > ---[ end trace 826a49848cffd17f ]--- > > --- [2] > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > index 6b76d81..7875427 100644 > --- a/kernel/rcutree.c > +++ b/kernel/rcutree.c > @@ -828,6 +828,39 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, > struct rcu_node *rnp, struct rcu_dat > __note_new_gpnum(rsp, rnp, rdp); > } > > +static void > +_debug_print(struct rcu_data *rdp) > +{ > + printk(KERN_ERR "%3d %p c=%lu g=%lu pq=%d pgp=%lu qp=%d > of=%lu ri=%lu ql=%ld qs=%c%c%c%c b=%ld ci=%lu co=%lu ca=%lu\n", > + rdp->cpu, rdp, > + rdp->completed, rdp->gpnum, > + rdp->passed_quiesce, rdp->passed_quiesce_gpnum, > + rdp->qs_pending, > + rdp->offline_fqs, rdp->resched_ipi, > + rdp->qlen, > + ".N"[rdp->nxttail[RCU_NEXT_READY_TAIL] != > + rdp->nxttail[RCU_NEXT_TAIL]], > + ".R"[rdp->nxttail[RCU_WAIT_TAIL] != > + rdp->nxttail[RCU_NEXT_READY_TAIL]], > + ".W"[rdp->nxttail[RCU_DONE_TAIL] != > + rdp->nxttail[RCU_WAIT_TAIL]], > + ".D"[&rdp->nxtlist != rdp->nxttail[RCU_DONE_TAIL]], > + rdp->blimit, rdp->n_cbs_invoked, rdp->n_cbs_orphaned, > rdp->n_cbs_adopted); > +} > + > +static void > +rcu_debug_print(struct rcu_state *rsp, struct rcu_node *rnp) > +{ > + int cpu; > + > + if (rnp->qsmask) { > + printk(KERN_ERR "CPU %d, treason uncloaked, rsp @ %p > (%s), rsp->gpnum is %lu, rnp @ %p, qsmask is 0x%lx\n", > + smp_processor_id(), rsp, rsp->name, > rsp->gpnum, rnp, rnp->qsmask); > + for_each_online_cpu(cpu) > + _debug_print(per_cpu_ptr(rsp->rda, cpu)); > + } > +} > + > /* > * Start a new RCU grace period if warranted, re-initializing the > hierarchy > * in preparation for detecting the next grace period. The caller > must hold > @@ -873,6 +906,7 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags) > > /* Special-case the common single-level case. */ > if (NUM_RCU_NODES == 1) { > + rcu_debug_print(rsp, rnp); > rcu_preempt_check_blocked_tasks(rnp); > rnp->qsmask = rnp->qsmaskinit; > rnp->gpnum = rsp->gpnum; > @@ -912,6 +946,7 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags) > */ > rcu_for_each_node_breadth_first(rsp, rnp) { > raw_spin_lock(&rnp->lock); /* irqs already > disabled. */ > + rcu_debug_print(rsp, rnp); > rcu_preempt_check_blocked_tasks(rnp); > rnp->qsmask = rnp->qsmaskinit; > rnp->gpnum = rsp->gpnum; > > --- [3] > > CPU 48, treason uncloaked, rsp @ ffffffff81a1d000 (rcu_sched), > rsp->gpnum is 4058, rsp->completed is 4057, rnp @ ffffffff81a1d000, > qsmask is 0x1 > 0 ffff8803f840d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=0 ri=279 > ql=1 qs=..W. b=10 ci=145949 co=0 ca=0 > 12 ffff880bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=7 ri=10 > ql=2 qs=.R.. b=10 ci=561 co=0 ca=0 > 24 ffff8813d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=14 > ql=0 qs=.... b=10 ci=406 co=0 ca=0 > 36 ffff881bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=5 ri=22 > ql=0 qs=.... b=10 ci=215 co=0 ca=0 > 48 ffff8823d040d4e0 c=4057 g=4057 pq=1 pgp=4057 qp=0 of=6 ri=14 > ql=26 qs=.RWD b=10 ci=7345 co=0 ca=0 And it looks like CPU 48 is the one that did not check in. What is CONFIG_NO_HZ set to? If CONFIG_NO_HZ=y, this can legitimately happen, and printing the values of rdp->dynticks->dynticks, rdp->dynticks->dynticks_nesting, rdp->dynticks->dynticks_nmi_nesting, and rdp->dynticks_fqs will determine whether or not the situation is legitimate or not. If CONFIG_NO_HZ=n, than this should not happen. In this case, I should be able to provide you with a more focused debug print. > 60 ffff882bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=16 > ql=0 qs=.... b=10 ci=50 co=0 ca=0 > 72 ffff8833d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=26 > ql=0 qs=.... b=10 ci=22 co=0 ca=0 > ------------[ cut here ]------------ > WARNING: at kernel/rcutree_plugin.h:1011 > rcu_preempt_check_blocked_tasks+0x27/0x30() > Hardware name: H8QI6 > Modules linked in: rcutorture > Pid: 4562, comm: rcu_torture_rea Not tainted 3.2.1-numaconnect10+ #59 > Call Trace: > [] ? rcu_preempt_check_blocked_tasks+0x27/0x30 > [] warn_slowpath_common+0x8b/0xc0 > [] warn_slowpath_null+0x15/0x20 > [] rcu_preempt_check_blocked_tasks+0x27/0x30 > [] rcu_start_gp+0x102/0x1b0 > [] __rcu_process_callbacks+0x8b/0xd0 > [] rcu_process_callbacks+0x20/0x40 > [] __do_softirq+0x9d/0x140 > [] ? rcu_torture_shuffle+0x80/0x80 [rcutorture] > [] call_softirq+0x1c/0x30 > [] do_softirq+0x4a/0x80 > [] irq_exit+0x43/0x60 > [] smp_apic_timer_interrupt+0x45/0x60 > [] apic_timer_interrupt+0x6b/0x70 > [] ? __schedule+0x34f/0x710 > [] ? update_curr+0x85/0xd0 > [] ? sched_clock_local+0x15/0x80 > [] ? sched_clock_cpu+0x65/0x90 > [] ? rcu_torture_shuffle+0x80/0x80 [rcutorture] > [] schedule+0x3a/0x60 > [] rcu_torture_reader+0x130/0x230 [rcutorture] > [] ? rcu_torture_writer+0x160/0x160 [rcutorture] > [] ? rcu_torture_shuffle+0x80/0x80 [rcutorture] > [] kthread+0x96/0xa0 > [] kernel_thread_helper+0x4/0x10 > [] ? kthread_stop+0x70/0x70 > [] ? gs_change+0xb/0xb > ---[ end trace 19f7bd7f233f1506 ]--- > CPU 48, treason uncloaked, rsp @ ffffffff81a1d000 (rcu_sched), > rsp->gpnum is 4058, rsp->completed is 4057, rnp @ ffffffff81a1d100, > qsmask is 0x8 > 0 ffff8803f840d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=0 ri=279 > ql=1 qs=..W. b=10 ci=145949 co=0 ca=0 > 12 ffff880bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=7 ri=10 > ql=3 qs=NR.. b=10 ci=561 co=0 ca=0 > 24 ffff8813d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=14 > ql=0 qs=.... b=10 ci=406 co=0 ca=0 > 36 ffff881bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=5 ri=22 > ql=0 qs=.... b=10 ci=215 co=0 ca=0 > 48 ffff8823d040d4e0 c=4057 g=4057 pq=1 pgp=4057 qp=0 of=6 ri=19 > ql=26 qs=.RWD b=10 ci=7345 co=0 ca=0 Hmmmm... Same CPU and grace-period number. This is a repeat of the above printout, correct? > 60 ffff882bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=16 > ql=0 qs=.... b=10 ci=50 co=0 ca=0 > 72 ffff8833d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=26 > ql=0 qs=.... b=10 ci=22 co=0 ca=0 > CPU 48, treason uncloaked, rsp @ ffffffff81a1d000 (rcu_sched), > rsp->gpnum is 4058, rsp->completed is 4057, rnp @ ffffffff81a1d800, > qsmask is 0x1 > 0 ffff8803f840d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=0 ri=280 > ql=1 qs=..W. b=10 ci=145949 co=0 ca=0 > 12 ffff880bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=7 ri=11 > ql=3 qs=NR.. b=10 ci=561 co=0 ca=0 > 24 ffff8813d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=15 > ql=0 qs=.... b=10 ci=406 co=0 ca=0 > 36 ffff881bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=5 ri=23 > ql=0 qs=.... b=10 ci=215 co=0 ca=0 > 48 ffff8823d040d4e0 c=4057 g=4057 pq=1 pgp=4057 qp=0 of=6 ri=21 > ql=26 qs=.RWD b=10 ci=7345 co=0 ca=0 Same here, but most of the ql= values are larger. Later printout? Thanx, Paul > 60 ffff882bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=16 > ql=0 qs=.... b=10 ci=50 co=0 ca=0 > 72 ffff8833d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=26 > ql=0 qs=.... b=10 ci=22 co=0 ca=0 > > -- > Daniel J Blueman > Principal Software Engineer, Numascale Asia >