From: Steffen Persvold <sp@numascale.com>
To: paulmck@linux.vnet.ibm.com
Cc: Daniel J Blueman <daniel@numascale-asia.com>,
Dipankar Sarma <dipankar@in.ibm.com>,
linux-kernel@vger.kernel.org, x86@kernel.org
Subject: Re: RCU qsmask !=0 warnings on large-SMP...
Date: Wed, 25 Jan 2012 21:35:15 +0100 [thread overview]
Message-ID: <4F206783.2050901@numascale.com> (raw)
In-Reply-To: <20120125181441.GD2849@linux.vnet.ibm.com>
On 1/25/2012 19:14, Paul E. McKenney wrote:
[]
>> CONFIG_NO_HZ is not set, so it should not happen. We see that the behavior is the same with CONFIG_NO_HZ=y though, but it takes longer to reproduce usually.
>
> OK, the CONFIG_NO_HZ=n has the least code involved, so it would be best
> for debugging.
Good, that was my thought also when looking at the code. I'm reducing
NR_CPUS to 512 now to get two levels just for simpler debugging (the
issue is still present).
[]
>> Because the RCU tree is 3 levels, the printout function we added in the patch gets called 3 times each time with the same RDP but with different RNPs (in rcu_start_gp()).
>
> Ah, good point. Hmmm...
>
> Looking back at Daniel's original email, we have the root rcu_node
> structure with ->qsmask=0x1 (indicating first descendent), the next
> level having ->qsmask=0x8 (indicating fourth descendent) and the last
> level having ->qsmask=0x1, again indicating first descendent. So,
> zero, 16, 32, 48. Which agrees with the CPU number below that has not
> yet caught up to the current grace period.
>
> Another really odd thing... If we are starting the new grace period,
> we should have incremented rsp->gpnum. And in fact, we do have
> rsp->gpnum being one larger than rsp->completed, as expected. But
> if we have only initialized the root rcu_node structures, how can
> the per-CPU rcudata structures know about the new grace period yet?
>
> There was a time when the CPUs could find out early, but I think that
> was a long time ago. Yes, check_for_new_grace_period() does compare
> rdp->gpnum against rsp->gpnum, but it calls note_new_gpnum() which
> acquires the rnp->lock(), and nothing happens unless __note_new_gpnum()
> sees that rnp->gpnum is different rdp->gpnum.
>
> So, it would be very interesting to add the values rdp->mynode->gpnum
> and rdp->mynode->completed to your list, perhaps labeling them something
> like "rng" and "rnc" respectively.
I will add this to the printout.
>
> Of course, CPU 48 should not have decided that it was done with the
> old grace period before clearing its bit. For that matter, some
> CPU somewhere decided that the grace period was done despite the
> root rcu_node's ->qsmask being non-zero, which should be prevented
> by the:
>
> if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {
>
> line in rcu_report_qs_rnp().
>
> 3.2 has some event tracing that would be extremely helpful in tracking
> this down. Are you able to run 3.2?
Yes, 3.2.1 is our debug target right now.
[]
>>> Same here, but most of the ql= values are larger. Later printout?
>>
>> The loop in rcu_start_gp() releases the node lock between each time it gets a new level in the RCU tree (it has to) :
>>
>> rcu_for_each_node_breadth_first(rsp, rnp) {
>> raw_spin_lock(&rnp->lock); /* irqs already disabled. */
>> rcu_debug_print(rsp, rnp);
>>
>> so I guess this allows ql= values to increase maybe, no ?
>
> The ql= levels can increase anyway -- those queues are only accessed by
> the corresponding CPU or from stop_machine context. The small increases
> are entirely consistent with your having bits set at all levels of the
> rcu_node tree. The reason I was surprised is that my earlier bugs (as
> in before the code hit mainline) only ever resulted in a single level
> having a stray bit.
Ok.
[]
>>
>> Thanks for looking into this Paul, we'd be more than happy to test out theories and patches.
>
> The event tracing, particularly the "rcu_grace_period" set, would be
> very helpful.
Are you talking about the data from /sys/kernel/debug/rcu/ ? I have
CONFIG_RCU_TRACE (and consequently CONFIG_TREE_RCU_TRACE) set, is this
enough to get the event data you want ?
Cheers,
--
Steffen Persvold, Chief Architect NumaChip
Numascale AS - www.numascale.com
Tel: +47 92 49 25 54 Skype: spersvold
next prev parent reply other threads:[~2012-01-25 20:35 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-25 9:44 RCU qsmask !=0 warnings on large-SMP Daniel J Blueman
2012-01-25 14:00 ` Paul E. McKenney
2012-01-25 14:18 ` Steffen Persvold
2012-01-25 18:14 ` Paul E. McKenney
2012-01-25 20:35 ` Steffen Persvold [this message]
2012-01-25 21:51 ` Paul E. McKenney
2012-01-25 22:51 ` Steffen Persvold
2012-01-26 1:57 ` Paul E. McKenney
2012-01-25 21:14 ` Steffen Persvold
2012-01-25 21:34 ` Paul E. McKenney
2012-01-25 22:48 ` Steffen Persvold
2012-01-26 1:58 ` Paul E. McKenney
2012-01-26 15:04 ` Steffen Persvold
2012-01-26 19:26 ` Paul E. McKenney
2012-01-27 11:09 ` Steffen Persvold
2012-01-29 6:09 ` Paul E. McKenney
2012-01-30 16:15 ` Paul E. McKenney
2012-01-31 17:33 ` Steffen Persvold
2012-01-31 17:38 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F206783.2050901@numascale.com \
--to=sp@numascale.com \
--cc=daniel@numascale-asia.com \
--cc=dipankar@in.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).