Re: RCU qsmask !=0 warnings on large-SMP...

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Steffen Persvold <sp@numascale.com>
Cc: Daniel J Blueman <daniel@numascale-asia.com>,
	Dipankar Sarma <dipankar@in.ibm.com>,
	linux-kernel@vger.kernel.org, x86@kernel.org
Subject: Re: RCU qsmask !=0 warnings on large-SMP...
Date: Wed, 25 Jan 2012 13:51:21 -0800	[thread overview]
Message-ID: <20120125215121.GK2849@linux.vnet.ibm.com> (raw)
In-Reply-To: <4F206783.2050901@numascale.com>

On Wed, Jan 25, 2012 at 09:35:15PM +0100, Steffen Persvold wrote:
> On 1/25/2012 19:14, Paul E. McKenney wrote:
> []
> >>CONFIG_NO_HZ is not set, so it should not happen. We see that the behavior is the same with CONFIG_NO_HZ=y though, but it takes longer to reproduce usually.
> >
> >OK, the CONFIG_NO_HZ=n has the least code involved, so it would be best
> >for debugging.
> 
> Good, that was my thought also when looking at the code. I'm
> reducing NR_CPUS to 512 now to get two levels just for simpler
> debugging (the issue is still present).

OK.  The two-level structure is very heavily tested -- any system with
NR_CPUS > 16 uses two levels.

So you do have something interesting going on.

> []
> >>Because the RCU tree is 3 levels, the printout function we added in the patch gets called 3 times each time with the same RDP but with different RNPs (in rcu_start_gp()).
> >
> >Ah, good point.  Hmmm...
> >
> >Looking back at Daniel's original email, we have the root rcu_node
> >structure with ->qsmask=0x1 (indicating first descendent), the next
> >level having ->qsmask=0x8 (indicating fourth descendent) and the last
> >level having ->qsmask=0x1, again indicating first descendent.  So,
> >zero, 16, 32, 48.  Which agrees with the CPU number below that has not
> >yet caught up to the current grace period.
> >
> >Another really odd thing...  If we are starting the new grace period,
> >we should have incremented rsp->gpnum.  And in fact, we do have
> >rsp->gpnum being one larger than rsp->completed, as expected.  But
> >if we have only initialized the root rcu_node structures, how can
> >the per-CPU rcudata structures know about the new grace period yet?
> >
> >There was a time when the CPUs could find out early, but I think that
> >was a long time ago.  Yes, check_for_new_grace_period() does compare
> >rdp->gpnum against rsp->gpnum, but it calls note_new_gpnum() which
> >acquires the rnp->lock(), and nothing happens unless __note_new_gpnum()
> >sees that rnp->gpnum is different rdp->gpnum.
> >
> >So, it would be very interesting to add the values rdp->mynode->gpnum
> >and rdp->mynode->completed to your list, perhaps labeling them something
> >like "rng" and "rnc" respectively.
> 
> I will add this to the printout.

I see that you did, thank you!

> >Of course, CPU 48 should not have decided that it was done with the
> >old grace period before clearing its bit.  For that matter, some
> >CPU somewhere decided that the grace period was done despite the
> >root rcu_node's ->qsmask being non-zero, which should be prevented
> >by the:
> >
> >	if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {
> >
> >line in rcu_report_qs_rnp().
> >
> >3.2 has some event tracing that would be extremely helpful in tracking
> >this down.  Are you able to run 3.2?
> 
> Yes, 3.2.1 is our debug target right now.

OK, good!  Could you please add an "ftrace_dump(DUMP_ALL)" before you
print the first set of error messages?  (Preferably set up so that
you only dump once!)

Then before you start testing, could you please enable the
rcu_grace_period and rcu_grace_period_init trace events?  This should
get a good picture of the sequence of grace-period-related events leading
up to the failure.

You will need to build the kernel with CONFIG_RCU_TRACE=y.

The usual commands suffice to enable tracing:

	echo 1 > /sys/kernel/debug/tracing/events/rcu_grace_period/enable
	echo 1 > /sys/kernel/debug/tracing/events/rcu_grace_period_init/enable

This should give some history that might help understand why the previous
grace period ended before the CPUs had all checked in.  Maybe even why
the rcu_node structures got advance notice of the new grace period...

> []
> >>>Same here, but most of the ql= values are larger.  Later printout?
> >>
> >>The loop in rcu_start_gp() releases the node lock between each time it gets a new level in the RCU tree (it has to) :
> >>
> >>	rcu_for_each_node_breadth_first(rsp, rnp) {
> >>		raw_spin_lock(&rnp->lock);	/* irqs already disabled. */
> >>		rcu_debug_print(rsp, rnp);
> >>
> >>so I guess this allows ql= values to increase maybe, no ?
> >
> >The ql= levels can increase anyway -- those queues are only accessed by
> >the corresponding CPU or from stop_machine context.  The small increases
> >are entirely consistent with your having bits set at all levels of the
> >rcu_node tree.  The reason I was surprised is that my earlier bugs (as
> >in before the code hit mainline) only ever resulted in a single level
> >having a stray bit.
> 
> Ok.
> 
> []
> >>
> >>Thanks for looking into this Paul, we'd be more than happy to test out theories and patches.
> >
> >The event tracing, particularly the "rcu_grace_period" set, would be
> >very helpful.
> 
> Are you talking about the data from /sys/kernel/debug/rcu/ ? I have
> CONFIG_RCU_TRACE (and consequently CONFIG_TREE_RCU_TRACE) set, is
> this enough to get the event data you want ?

Yep, if you have CONFIG_RCU_TRACE=y, then the two tracepoints should
be available.

							Thanx, Paul

next prev parent reply	other threads:[~2012-01-25 21:52 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-25  9:44 RCU qsmask !=0 warnings on large-SMP Daniel J Blueman
2012-01-25 14:00 ` Paul E. McKenney
2012-01-25 14:18   ` Steffen Persvold
2012-01-25 18:14     ` Paul E. McKenney
2012-01-25 20:35       ` Steffen Persvold
2012-01-25 21:51         ` Paul E. McKenney [this message]
2012-01-25 22:51           ` Steffen Persvold
2012-01-26  1:57             ` Paul E. McKenney
2012-01-25 21:14       ` Steffen Persvold
2012-01-25 21:34         ` Paul E. McKenney
2012-01-25 22:48           ` Steffen Persvold
2012-01-26  1:58             ` Paul E. McKenney
2012-01-26 15:04               ` Steffen Persvold
2012-01-26 19:26                 ` Paul E. McKenney
2012-01-27 11:09                   ` Steffen Persvold
2012-01-29  6:09                     ` Paul E. McKenney
2012-01-30 16:15                       ` Paul E. McKenney
2012-01-31 17:33                         ` Steffen Persvold
2012-01-31 17:38                           ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120125215121.GK2849@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=daniel@numascale-asia.com \
    --cc=dipankar@in.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sp@numascale.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).