False-positive RCU stall warnings on large systems...

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* False-positive RCU stall warnings on large systems...
@ 2013-02-19 16:34 Daniel J Blueman
  2013-02-19 18:16 ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel J Blueman @ 2013-02-19 16:34 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steffen Persvold, LKML

Hi Paul,

On some of our larger servers with many hundreds of cores and when under 
high duress, we can see scheduler RCU stall warnings [1], so find we 
have to increase the hardcoded RCU_STALL_RAT_DELAY up from 2 and 
RCU_JIFFIES_TILL_FORCE_QS up from 3.

Is there a more sustainable way to account for this to avoid it being 
hard-coded, such as making it and dependent timeouts a fraction of 
CONFIG_RCU_CPU_STALL_TIMEOUT?

On the other hand, perhaps this is just caused by clock jitter (eg due 
to distance from a contended clock source)? So increasing these a bit 
may just be adequate in general...

Many thanks,
   Daniel

--- [1]

[ 3939.010085] INFO: rcu_sched detected stalls on CPUs/tasks: {} 
(detected by 1, t=29662 jiffies, g=3053, c=3052, q=598)
[ 3939.020008] INFO: Stall ended before state dump start
-- 
Daniel J Blueman
Principal Software Engineer, Numascale Asia

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: False-positive RCU stall warnings on large systems...
  2013-02-19 16:34 False-positive RCU stall warnings on large systems Daniel J Blueman
@ 2013-02-19 18:16 ` Paul E. McKenney
  2013-02-20  3:35   ` Daniel J Blueman
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2013-02-19 18:16 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Paul E. McKenney, Steffen Persvold, LKML

On Wed, Feb 20, 2013 at 12:34:12AM +0800, Daniel J Blueman wrote:
> Hi Paul,
> 
> On some of our larger servers with many hundreds of cores and when
> under high duress, we can see scheduler RCU stall warnings [1], so
> find we have to increase the hardcoded RCU_STALL_RAT_DELAY up from 2
> and RCU_JIFFIES_TILL_FORCE_QS up from 3.
> 
> Is there a more sustainable way to account for this to avoid it
> being hard-coded, such as making it and dependent timeouts a
> fraction of CONFIG_RCU_CPU_STALL_TIMEOUT?
> 
> On the other hand, perhaps this is just caused by clock jitter (eg
> due to distance from a contended clock source)? So increasing these
> a bit may just be adequate in general...

Hmmm...  What version of the kernel are you running?

							Thanx, Paul

> Many thanks,
>   Daniel
> 
> --- [1]
> 
> [ 3939.010085] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 1, t=29662 jiffies, g=3053, c=3052, q=598)
> [ 3939.020008] INFO: Stall ended before state dump start
> -- 
> Daniel J Blueman
> Principal Software Engineer, Numascale Asia
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: False-positive RCU stall warnings on large systems...
  2013-02-19 18:16 ` Paul E. McKenney
@ 2013-02-20  3:35   ` Daniel J Blueman
  2013-02-25 16:32     ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel J Blueman @ 2013-02-20  3:35 UTC (permalink / raw)
  To: paulmck; +Cc: Paul E. McKenney, Steffen Persvold, LKML

On 20/02/2013 02:16, Paul E. McKenney wrote:
> On Wed, Feb 20, 2013 at 12:34:12AM +0800, Daniel J Blueman wrote:
>> Hi Paul,
>>
>> On some of our larger servers with many hundreds of cores and when
>> under high duress, we can see scheduler RCU stall warnings [1], so
>> find we have to increase the hardcoded RCU_STALL_RAT_DELAY up from 2
>> and RCU_JIFFIES_TILL_FORCE_QS up from 3.
>>
>> Is there a more sustainable way to account for this to avoid it
>> being hard-coded, such as making it and dependent timeouts a
>> fraction of CONFIG_RCU_CPU_STALL_TIMEOUT?
>>
>> On the other hand, perhaps this is just caused by clock jitter (eg
>> due to distance from a contended clock source)? So increasing these
>> a bit may just be adequate in general...
>
> Hmmm...  What version of the kernel are you running?

The example below occurs with v3.8, but we see the same with previous 
kernels eg v3.5.

Of course, when using the local TSC, you'd see no jitter relative to 
coherent transactions (eg memory writes), but when the HPET is used 
across a large system, coherent transactions to distant cores are just 
so much faster, as there's massive congestion to the shared HPET behind 
various HT and PCIe bridges. This could be where the jitter arises from, 
if I'm guessing jitter is the problem here.

Thanks,
   Daniel

>> --- [1]
>>
>> [ 3939.010085] INFO: rcu_sched detected stalls on CPUs/tasks: {}
>> (detected by 1, t=29662 jiffies, g=3053, c=3052, q=598)
>> [ 3939.020008] INFO: Stall ended before state dump start
-- 
Daniel J Blueman
Principal Software Engineer, Numascale Asia

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: False-positive RCU stall warnings on large systems...
  2013-02-20  3:35   ` Daniel J Blueman
@ 2013-02-25 16:32     ` Paul E. McKenney
  2013-03-05  9:02       ` Daniel J Blueman
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2013-02-25 16:32 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Paul E. McKenney, Steffen Persvold, LKML

On Wed, Feb 20, 2013 at 11:35:57AM +0800, Daniel J Blueman wrote:
> On 20/02/2013 02:16, Paul E. McKenney wrote:
> >On Wed, Feb 20, 2013 at 12:34:12AM +0800, Daniel J Blueman wrote:
> >>Hi Paul,
> >>
> >>On some of our larger servers with many hundreds of cores and when
> >>under high duress, we can see scheduler RCU stall warnings [1], so
> >>find we have to increase the hardcoded RCU_STALL_RAT_DELAY up from 2
> >>and RCU_JIFFIES_TILL_FORCE_QS up from 3.

Disabling RCU_FAST_NO_HZ will likely remove the need to adjust
RCU_JIFFIES_TILL_FORCE_QS.  Changes in my -rcu tree will likely remove the
need to adjust these two in 3.10 or 3.11, depending on how testing goes.

> >>Is there a more sustainable way to account for this to avoid it
> >>being hard-coded, such as making it and dependent timeouts a
> >>fraction of CONFIG_RCU_CPU_STALL_TIMEOUT?

Maybe...  But what this means is that your system is so heavily loaded
that the CPU in question is failing to make it to RCU's softirq handler
in two jiffies worth of time.  This is a function of workload rather
than of the number of CPUs.

> >>On the other hand, perhaps this is just caused by clock jitter (eg
> >>due to distance from a contended clock source)? So increasing these
> >>a bit may just be adequate in general...
> >
> >Hmmm...  What version of the kernel are you running?
> 
> The example below occurs with v3.8, but we see the same with
> previous kernels eg v3.5.

There is always the rcutree.rcu_cpu_stall_timeout parameter that sets
the stall timeout in seconds.  This may be specified at boot time or
via sysfs at runtime.  The default is now 21 seconds.

> Of course, when using the local TSC, you'd see no jitter relative to
> coherent transactions (eg memory writes), but when the HPET is used
> across a large system, coherent transactions to distant cores are
> just so much faster, as there's massive congestion to the shared
> HPET behind various HT and PCIe bridges. This could be where the
> jitter arises from, if I'm guessing jitter is the problem here.

Agreed, timing jitter could cause problems.  That said, the code uses
the jiffies counter to compute these timings.  Are you seeing similar
memory contention on the jiffies counter itself?

							Thanx, Paul

> Thanks,
>   Daniel
> 
> >>--- [1]
> >>
> >>[ 3939.010085] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> >>(detected by 1, t=29662 jiffies, g=3053, c=3052, q=598)
> >>[ 3939.020008] INFO: Stall ended before state dump start
> -- 
> Daniel J Blueman
> Principal Software Engineer, Numascale Asia
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: False-positive RCU stall warnings on large systems...
  2013-02-25 16:32     ` Paul E. McKenney
@ 2013-03-05  9:02       ` Daniel J Blueman
  2013-03-06 17:03         ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel J Blueman @ 2013-03-05  9:02 UTC (permalink / raw)
  To: paulmck; +Cc: Paul E. McKenney, Steffen Persvold, LKML

On 02/26/2013 12:32 AM, Paul E. McKenney wrote:
> On Wed, Feb 20, 2013 at 11:35:57AM +0800, Daniel J Blueman wrote:
>> On 20/02/2013 02:16, Paul E. McKenney wrote:
>>> On Wed, Feb 20, 2013 at 12:34:12AM +0800, Daniel J Blueman wrote:
>>>> Hi Paul,
>>>>
>>>> On some of our larger servers with many hundreds of cores and when
>>>> under high duress, we can see scheduler RCU stall warnings [1], so
>>>> find we have to increase the hardcoded RCU_STALL_RAT_DELAY up from 2
>>>> and RCU_JIFFIES_TILL_FORCE_QS up from 3.
>
> Disabling RCU_FAST_NO_HZ will likely remove the need to adjust
> RCU_JIFFIES_TILL_FORCE_QS.  Changes in my -rcu tree will likely remove the
> need to adjust these two in 3.10 or 3.11, depending on how testing goes.
>
>>>> Is there a more sustainable way to account for this to avoid it
>>>> being hard-coded, such as making it and dependent timeouts a
>>>> fraction of CONFIG_RCU_CPU_STALL_TIMEOUT?
>
> Maybe...  But what this means is that your system is so heavily loaded
> that the CPU in question is failing to make it to RCU's softirq handler
> in two jiffies worth of time.  This is a function of workload rather
> than of the number of CPUs.
>
>>>> On the other hand, perhaps this is just caused by clock jitter (eg
>>>> due to distance from a contended clock source)? So increasing these
>>>> a bit may just be adequate in general...
>>>
>>> Hmmm...  What version of the kernel are you running?
>>
>> The example below occurs with v3.8, but we see the same with
>> previous kernels eg v3.5.
>
> There is always the rcutree.rcu_cpu_stall_timeout parameter that sets
> the stall timeout in seconds.  This may be specified at boot time or
> via sysfs at runtime.  The default is now 21 seconds.
>
>> Of course, when using the local TSC, you'd see no jitter relative to
>> coherent transactions (eg memory writes), but when the HPET is used
>> across a large system, coherent transactions to distant cores are
>> just so much faster, as there's massive congestion to the shared
>> HPET behind various HT and PCIe bridges. This could be where the
>> jitter arises from, if I'm guessing jitter is the problem here.
>
> Agreed, timing jitter could cause problems.  That said, the code uses
> the jiffies counter to compute these timings.  Are you seeing similar
> memory contention on the jiffies counter itself?

The contention we see in general are when cores contend for a spinlock 
and when there are lots of concurrent HPET reads (Opterons allow only 4 
outstanding reads to the IO hub).

It's probably possible to reproduce rcu_sched stalls on a quad-socket 
box with 64 cores and the right workload with the TSC disabled.

In 3.9-rc1 with RCU_FAST_NO_HZ disabled, we've seen stalls of 4 jiffies 
[2], but without the "Stall ended" message. This is with a workload 
which allocates ~256GB of memory over 192 cores.

Thanks,
   Daniel

>>>> --- [1]
>>>>
>>>> [ 3939.010085] INFO: rcu_sched detected stalls on CPUs/tasks: {}
>>>> (detected by 1, t=29662 jiffies, g=3053, c=3052, q=598)
>>>> [ 3939.020008] INFO: Stall ended before state dump start

--- [2]

[10660.110620] INFO: rcu_sched self-detected stall on CPU { 39}  (t=4 
jiffies g=1169 c=1168 q=8)
[10660.110620] Pid: 11747, comm: sp.B Not tainted 3.9.0-rc1-advanced #6
[10660.110620] Call Trace:
[10660.110620]  <IRQ>  [<ffffffff810b2a42>] ? 
rcu_check_callbacks+0x2d2/0x5f0
[10660.110620]  [<ffffffff8107d94a>] ? run_posix_cpu_timers+0x3a/0x790
[10660.110620]  [<ffffffff8106c86f>] ? update_process_times+0x3f/0x80
[10660.110620]  [<ffffffff81098280>] ? tick_sched_handle.isra.8+0x30/0x40
[10660.110620]  [<ffffffff810983b2>] ? tick_sched_timer+0x42/0x70
[10660.110620]  [<ffffffff8107e66a>] ? __run_hrtimer.isra.30+0x4a/0xe0
[10660.110620]  [<ffffffff8107ef45>] ? hrtimer_interrupt+0xe5/0x220
[10660.110620]  [<ffffffff8104c5a3>] ? smp_apic_timer_interrupt+0x63/0xa0
[10660.110620]  [<ffffffff8186e887>] ? apic_timer_interrupt+0x67/0x70
-- 
Daniel J Blueman
Principal Software Engineer, Numascale Asia

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: False-positive RCU stall warnings on large systems...
  2013-03-05  9:02       ` Daniel J Blueman
@ 2013-03-06 17:03         ` Paul E. McKenney
  0 siblings, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2013-03-06 17:03 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Paul E. McKenney, Steffen Persvold, LKML

On Tue, Mar 05, 2013 at 05:02:37PM +0800, Daniel J Blueman wrote:
> On 02/26/2013 12:32 AM, Paul E. McKenney wrote:
> >On Wed, Feb 20, 2013 at 11:35:57AM +0800, Daniel J Blueman wrote:
> >>On 20/02/2013 02:16, Paul E. McKenney wrote:
> >>>On Wed, Feb 20, 2013 at 12:34:12AM +0800, Daniel J Blueman wrote:
> >>>>Hi Paul,
> >>>>
> >>>>On some of our larger servers with many hundreds of cores and when
> >>>>under high duress, we can see scheduler RCU stall warnings [1], so
> >>>>find we have to increase the hardcoded RCU_STALL_RAT_DELAY up from 2
> >>>>and RCU_JIFFIES_TILL_FORCE_QS up from 3.
> >
> >Disabling RCU_FAST_NO_HZ will likely remove the need to adjust
> >RCU_JIFFIES_TILL_FORCE_QS.  Changes in my -rcu tree will likely remove the
> >need to adjust these two in 3.10 or 3.11, depending on how testing goes.
> >
> >>>>Is there a more sustainable way to account for this to avoid it
> >>>>being hard-coded, such as making it and dependent timeouts a
> >>>>fraction of CONFIG_RCU_CPU_STALL_TIMEOUT?
> >
> >Maybe...  But what this means is that your system is so heavily loaded
> >that the CPU in question is failing to make it to RCU's softirq handler
> >in two jiffies worth of time.  This is a function of workload rather
> >than of the number of CPUs.
> >
> >>>>On the other hand, perhaps this is just caused by clock jitter (eg
> >>>>due to distance from a contended clock source)? So increasing these
> >>>>a bit may just be adequate in general...
> >>>
> >>>Hmmm...  What version of the kernel are you running?
> >>
> >>The example below occurs with v3.8, but we see the same with
> >>previous kernels eg v3.5.
> >
> >There is always the rcutree.rcu_cpu_stall_timeout parameter that sets
> >the stall timeout in seconds.  This may be specified at boot time or
> >via sysfs at runtime.  The default is now 21 seconds.
> >
> >>Of course, when using the local TSC, you'd see no jitter relative to
> >>coherent transactions (eg memory writes), but when the HPET is used
> >>across a large system, coherent transactions to distant cores are
> >>just so much faster, as there's massive congestion to the shared
> >>HPET behind various HT and PCIe bridges. This could be where the
> >>jitter arises from, if I'm guessing jitter is the problem here.
> >
> >Agreed, timing jitter could cause problems.  That said, the code uses
> >the jiffies counter to compute these timings.  Are you seeing similar
> >memory contention on the jiffies counter itself?
> 
> The contention we see in general are when cores contend for a
> spinlock and when there are lots of concurrent HPET reads (Opterons
> allow only 4 outstanding reads to the IO hub).

Seems to me that the RCU CPU stalls are correctly detecting that your
platform has a problem, namely excessive contention on your HPET hardware.

> It's probably possible to reproduce rcu_sched stalls on a
> quad-socket box with 64 cores and the right workload with the TSC
> disabled.
> 
> In 3.9-rc1 with RCU_FAST_NO_HZ disabled, we've seen stalls of 4
> jiffies [2], but without the "Stall ended" message. This is with a
> workload which allocates ~256GB of memory over 192 cores.

Right, the "stall ended" messages appear only when one CPU detects
another's stall.  In your case below, there is a self-detected stall,
and it does not check.

So I could add a similar check in the self-detect case, if that is what
you are getting at.

							Thanx, Paul

> Thanks,
>   Daniel
> 
> >>>>--- [1]
> >>>>
> >>>>[ 3939.010085] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> >>>>(detected by 1, t=29662 jiffies, g=3053, c=3052, q=598)
> >>>>[ 3939.020008] INFO: Stall ended before state dump start
> 
> --- [2]
> 
> [10660.110620] INFO: rcu_sched self-detected stall on CPU { 39}
> (t=4 jiffies g=1169 c=1168 q=8)
> [10660.110620] Pid: 11747, comm: sp.B Not tainted 3.9.0-rc1-advanced #6
> [10660.110620] Call Trace:
> [10660.110620]  <IRQ>  [<ffffffff810b2a42>] ?
> rcu_check_callbacks+0x2d2/0x5f0
> [10660.110620]  [<ffffffff8107d94a>] ? run_posix_cpu_timers+0x3a/0x790
> [10660.110620]  [<ffffffff8106c86f>] ? update_process_times+0x3f/0x80
> [10660.110620]  [<ffffffff81098280>] ? tick_sched_handle.isra.8+0x30/0x40
> [10660.110620]  [<ffffffff810983b2>] ? tick_sched_timer+0x42/0x70
> [10660.110620]  [<ffffffff8107e66a>] ? __run_hrtimer.isra.30+0x4a/0xe0
> [10660.110620]  [<ffffffff8107ef45>] ? hrtimer_interrupt+0xe5/0x220
> [10660.110620]  [<ffffffff8104c5a3>] ? smp_apic_timer_interrupt+0x63/0xa0
> [10660.110620]  [<ffffffff8186e887>] ? apic_timer_interrupt+0x67/0x70
> -- 
> Daniel J Blueman
> Principal Software Engineer, Numascale Asia
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-03-06 17:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-19 16:34 False-positive RCU stall warnings on large systems Daniel J Blueman
2013-02-19 18:16 ` Paul E. McKenney
2013-02-20  3:35   ` Daniel J Blueman
2013-02-25 16:32     ` Paul E. McKenney
2013-03-05  9:02       ` Daniel J Blueman
2013-03-06 17:03         ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox