offlining cpus breakage

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* offlining cpus breakage
@ 2015-01-07  9:37 Alexey Kardashevskiy
  2015-01-14  4:20 ` Shreyas B Prabhu
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-07  9:37 UTC (permalink / raw)
  To: linuxppc-dev@lists.ozlabs.org, Shreyas B. Prabhu; +Cc: Paul Mackerras

Hi!

"ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel
(sha1 bdec419):

NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0]

or

INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 2
4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616,
q=1441)

and many others, all about lockups

I did bisecting and found out that reverting these helps:

77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle
support for offline cpus
7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle
states management
8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline
CPUs to enter deep idle states

btw reverting just two of them produces a compile error.

It is pseries_le_defconfig, POWER8 machine:
timebase        : 512000000
platform        : PowerNV
model           : palmetto
machine         : PowerNV palmetto
firmware        : OPAL v3


Please help to fix it. Thanks.


-- 
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-07  9:37 offlining cpus breakage Alexey Kardashevskiy
@ 2015-01-14  4:20 ` Shreyas B Prabhu
  2015-01-14 11:03 ` Shreyas B Prabhu
  2015-01-15 13:22 ` Preeti U Murthy
  2 siblings, 0 replies; 12+ messages in thread
From: Shreyas B Prabhu @ 2015-01-14  4:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev@lists.ozlabs.org
  Cc: preeti U Murthy, Paul Mackerras



Hi,

On Wednesday 07 January 2015 03:07 PM, Alexey Kardashevskiy wrote:
> Hi!
> 
> "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel
> (sha1 bdec419):
> 
> NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0]
> 
> or
> 
> INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15
> 16 17 18 19 20 21 22 23 2
> 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616,
> q=1441)
> 
> and many others, all about lockups
> 
> I did bisecting and found out that reverting these helps:
> 
> 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle
> support for offline cpus
> 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle
> states management
> 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline
> CPUs to enter deep idle states
> 
> btw reverting just two of them produces a compile error.
> 
> It is pseries_le_defconfig, POWER8 machine:
> timebase        : 512000000
> platform        : PowerNV
> model           : palmetto
> machine         : PowerNV palmetto
> firmware        : OPAL v3
> 
> 
> Please help to fix it. Thanks.
> 
> 

Upon investigation, we figured that the cpu is stuck in cpu_idle_poll
loop in kernel/sched/idle.c leading us to believe the bug is in timer
offload framework which fastsleep uses. Preeti and I are working on a
fix. We'll post it out as soon as possible.

Thanks,
Shreyas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-07  9:37 offlining cpus breakage Alexey Kardashevskiy
  2015-01-14  4:20 ` Shreyas B Prabhu
@ 2015-01-14 11:03 ` Shreyas B Prabhu
  2015-01-15 13:22 ` Preeti U Murthy
  2 siblings, 0 replies; 12+ messages in thread
From: Shreyas B Prabhu @ 2015-01-14 11:03 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev@lists.ozlabs.org
  Cc: Preeti U Murthy, Paul Mackerras

On Wednesday 07 January 2015 03:07 PM, Alexey Kardashevskiy wrote:
> Hi!
> 
> "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel
> (sha1 bdec419):
> 
> NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0]
> 
> or
> 
> INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15
> 16 17 18 19 20 21 22 23 2
> 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616,
> q=1441)
> 
> and many others, all about lockups
> 
> I did bisecting and found out that reverting these helps:
> 
> 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle
> support for offline cpus
> 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle
> states management
> 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline
> CPUs to enter deep idle states
> 
> btw reverting just two of them produces a compile error.
> 
> It is pseries_le_defconfig, POWER8 machine:
> timebase        : 512000000
> platform        : PowerNV
> model           : palmetto
> machine         : PowerNV palmetto
> firmware        : OPAL v3
> 
> 

The bug scenario is as follows:

In fastsleep decrementer state is not maintained, thus a cpu entering
fastsleep offloads its timer to a different cpu (lets call this
broadcast cpu). Now in the event that this broadcast cpu is offlined, it
assigns a new cpu with the task to handle broadcasting.

If this new cpu is one of the cpus which had entered fastsleep, its
decrementer will have been in an invalid state. This cpu has been woken
up by a need resched ipi (to take up the task of broadcasting) as
opposed to a broadcast ipi. The decrementer state is fixed only on a
broadcast ipi and not on a need resched ipi. Because of this, its timers
don't fire. Consequently it cannot wake up any cpu relying on broadcast ipi.

This scenario of a cpu that takes up the task of broadcasting being in
fastsleep is a corner case. This almost never happens on machines with
more number of cores. This explains why Alexey was able to hit it easily
on palmetto.

We'll be posting out a fix for this soon.

Thanks,
Shreyas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-07  9:37 offlining cpus breakage Alexey Kardashevskiy
  2015-01-14  4:20 ` Shreyas B Prabhu
  2015-01-14 11:03 ` Shreyas B Prabhu
@ 2015-01-15 13:22 ` Preeti U Murthy
  2015-01-16  0:28   ` Alexey Kardashevskiy
  2 siblings, 1 reply; 12+ messages in thread
From: Preeti U Murthy @ 2015-01-15 13:22 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev@lists.ozlabs.org,
	Shreyas B. Prabhu
  Cc: Paul Mackerras, Anton Blanchard

Hi Alexey,

Can you let me know if the following patch fixes the issue for you ?
It did for us on one of our machines that we were investigating on.

Anton,
Can you let me know if the patch fixes the odd perf report that you observed?
This is the latest fix that there is to it.

-------------------------------START PATCH--------------------------------------

tick/broadcast-hrtimer : Make movement of broadcast hrtimer robust against cpu offline

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>

When a cpu on which the broadcast hrtimer is queued goes offline, the hrtimer
needs to be moved to another cpu. There maybe potential race conditions here,
which can lead to the hrtimer not being queued on any cpu.

There was a report on softlockups, with cpus being stuck in deep idle states,
when smt mode switching was being done, especially on systems with smaller number of
cpus [1]. This is hard to reproduce on a large machine because the chances that an
offline cpu was the stand by cpu handling broadcast become lesser. Given the
current code, the only situation that looks capable of triggering scenarios where
broadcast IPIs are not delivered, is in the cpu offline path. I am
at a loss to pin point the precise scenario.

Therefore to avoid any possible race condition between cpu offline and
movement of the broadcast hrtimer, this patch ensures that the act of keeping
track of broadcasting wakeup IPIs is started afresh after a cpu offline operation.
This is done by reseting the expiry time of the hrtimer to a max value. This
has to be done in the CPU_DYING phase so that it is visible to all cpus right
after exiting stop_machine.

The rationale is that during cpu offline, since all cpus are woken up anyway
to run stop_machine, we are only required to keep track of broadcasting to cpus
that henceforth enter deep idle states. This ensures that the broadcast hrtimer
gets positively queued on a cpu as long as there are cpus in deep idle states.

It has to be noted that this code is in addition to retaining the logic that we
have today in the broadcast code on an offline operation. If this logic fails to
move the broadcast hrtimer due to a race condition we have the following patch to
handle it right.

[1]http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html
There is no issue in programming the decrementer as was presumed and stated in
this link.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 kernel/time/clockevents.c    |    2 +-
 kernel/time/tick-broadcast.c |    1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)

 	case CLOCK_EVT_NOTIFY_CPU_DYING:
 		tick_handover_do_timer(arg);
+		tick_shutdown_broadcast_oneshot(arg);
 		break;

 	case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
 		break;

 	case CLOCK_EVT_NOTIFY_CPU_DEAD:
-		tick_shutdown_broadcast_oneshot(arg);
 		tick_shutdown_broadcast(arg);
 		tick_shutdown(arg);
 		/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..1f5eda6 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -677,6 +677,7 @@ static void broadcast_move_bc(int deadcpu)
 		return;
 	/* This moves the broadcast assignment to this cpu */
 	clockevents_program_event(bc, bc->next_event, 1);
+	bc->next_event.tv64 = KTIME_MAX;
 }

 /*

---------------------------------------END PATCH-------------------------
Thanks

Regards
Preeti U Murthy

On 01/07/2015 03:07 PM, Alexey Kardashevskiy wrote:
> Hi!
> 
> "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel
> (sha1 bdec419):
> 
> NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0]
> 
> or
> 
> INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15
> 16 17 18 19 20 21 22 23 2
> 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616,
> q=1441)
> 
> and many others, all about lockups
> 
> I did bisecting and found out that reverting these helps:
> 
> 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle
> support for offline cpus
> 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle
> states management
> 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline
> CPUs to enter deep idle states
> 
> btw reverting just two of them produces a compile error.
> 
> It is pseries_le_defconfig, POWER8 machine:
> timebase        : 512000000
> platform        : PowerNV
> model           : palmetto
> machine         : PowerNV palmetto
> firmware        : OPAL v3
> 
> 
> Please help to fix it. Thanks.
> 
> 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-15 13:22 ` Preeti U Murthy
@ 2015-01-16  0:28   ` Alexey Kardashevskiy
  2015-01-16  3:04     ` Michael Ellerman
  0 siblings, 1 reply; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-01-16  0:28 UTC (permalink / raw)
  To: Preeti U Murthy, linuxppc-dev@lists.ozlabs.org, Shreyas B. Prabhu
  Cc: Paul Mackerras, Anton Blanchard

On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
> Hi Alexey,
> 
> Can you let me know if the following patch fixes the issue for you ?
> It did for us on one of our machines that we were investigating on.


This fixes the issue for me as well, thanks!

Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	


> Anton,
> Can you let me know if the patch fixes the odd perf report that you observed?
> This is the latest fix that there is to it.
> 
> -------------------------------START PATCH--------------------------------------
> 
> tick/broadcast-hrtimer : Make movement of broadcast hrtimer robust against cpu offline
> 
> From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> 
> When a cpu on which the broadcast hrtimer is queued goes offline, the hrtimer
> needs to be moved to another cpu. There maybe potential race conditions here,
> which can lead to the hrtimer not being queued on any cpu.
> 
> There was a report on softlockups, with cpus being stuck in deep idle states,
> when smt mode switching was being done, especially on systems with smaller number of
> cpus [1]. This is hard to reproduce on a large machine because the chances that an
> offline cpu was the stand by cpu handling broadcast become lesser. Given the
> current code, the only situation that looks capable of triggering scenarios where
> broadcast IPIs are not delivered, is in the cpu offline path. I am
> at a loss to pin point the precise scenario.
> 
> Therefore to avoid any possible race condition between cpu offline and
> movement of the broadcast hrtimer, this patch ensures that the act of keeping
> track of broadcasting wakeup IPIs is started afresh after a cpu offline operation.
> This is done by reseting the expiry time of the hrtimer to a max value. This
> has to be done in the CPU_DYING phase so that it is visible to all cpus right
> after exiting stop_machine.
> 
> The rationale is that during cpu offline, since all cpus are woken up anyway
> to run stop_machine, we are only required to keep track of broadcasting to cpus
> that henceforth enter deep idle states. This ensures that the broadcast hrtimer
> gets positively queued on a cpu as long as there are cpus in deep idle states.
> 
> It has to be noted that this code is in addition to retaining the logic that we
> have today in the broadcast code on an offline operation. If this logic fails to
> move the broadcast hrtimer due to a race condition we have the following patch to
> handle it right.
> 
> [1]http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html
> There is no issue in programming the decrementer as was presumed and stated in
> this link.
> 
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> ---
>  kernel/time/clockevents.c    |    2 +-
>  kernel/time/tick-broadcast.c |    1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 5544990..f3907c9 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
>  
>  	case CLOCK_EVT_NOTIFY_CPU_DYING:
>  		tick_handover_do_timer(arg);
> +		tick_shutdown_broadcast_oneshot(arg);
>  		break;
>  
>  	case CLOCK_EVT_NOTIFY_SUSPEND:
> @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
>  		break;
>  
>  	case CLOCK_EVT_NOTIFY_CPU_DEAD:
> -		tick_shutdown_broadcast_oneshot(arg);
>  		tick_shutdown_broadcast(arg);
>  		tick_shutdown(arg);
>  		/*
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index 066f0ec..1f5eda6 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -677,6 +677,7 @@ static void broadcast_move_bc(int deadcpu)
>  		return;
>  	/* This moves the broadcast assignment to this cpu */
>  	clockevents_program_event(bc, bc->next_event, 1);
> +	bc->next_event.tv64 = KTIME_MAX;
>  }
>  
>  /*
> 
> ---------------------------------------END PATCH-------------------------
> Thanks
> 
> Regards
> Preeti U Murthy
> 
> On 01/07/2015 03:07 PM, Alexey Kardashevskiy wrote:
>> Hi!
>>
>> "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel
>> (sha1 bdec419):
>>
>> NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0]
>>
>> or
>>
>> INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15
>> 16 17 18 19 20 21 22 23 2
>> 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616,
>> q=1441)
>>
>> and many others, all about lockups
>>
>> I did bisecting and found out that reverting these helps:
>>
>> 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle
>> support for offline cpus
>> 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle
>> states management
>> 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline
>> CPUs to enter deep idle states
>>
>> btw reverting just two of them produces a compile error.
>>
>> It is pseries_le_defconfig, POWER8 machine:
>> timebase        : 512000000
>> platform        : PowerNV
>> model           : palmetto
>> machine         : PowerNV palmetto
>> firmware        : OPAL v3
>>
>>
>> Please help to fix it. Thanks.
>>
>>
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-16  0:28   ` Alexey Kardashevskiy
@ 2015-01-16  3:04     ` Michael Ellerman
  2015-01-16  8:56       ` Preeti U Murthy
  2015-01-17 13:39       ` Preeti U Murthy
  0 siblings, 2 replies; 12+ messages in thread
From: Michael Ellerman @ 2015-01-16  3:04 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Preeti U Murthy, Shreyas B. Prabhu, linuxppc-dev@lists.ozlabs.org,
	Anton Blanchard, Paul Mackerras

On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
> > Hi Alexey,
> > 
> > Can you let me know if the following patch fixes the issue for you ?
> > It did for us on one of our machines that we were investigating on.
> 
> This fixes the issue for me as well, thanks!
> 
> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	

OK, that's great.

But, I really don't think we can ask upstream to merge this patch to generic
code when we don't have a good explanation for why it's necessary. At least I'm
not going to ask anyone to do that :)

So Pretti can you either write a 100% convincing explanation of why this patch
is correct in the general case, or (preferably) do some more investigating to
work out what Alexey's bug actually is.

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-16  3:04     ` Michael Ellerman
@ 2015-01-16  8:56       ` Preeti U Murthy
  2015-01-16  9:10         ` Preeti U Murthy
  2015-01-17 13:39       ` Preeti U Murthy
  1 sibling, 1 reply; 12+ messages in thread
From: Preeti U Murthy @ 2015-01-16  8:56 UTC (permalink / raw)
  To: Michael Ellerman, Alexey Kardashevskiy
  Cc: Shreyas B. Prabhu, linuxppc-dev@lists.ozlabs.org, Anton Blanchard,
	Paul Mackerras

On 01/16/2015 08:34 AM, Michael Ellerman wrote:
> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
>>> Hi Alexey,
>>>
>>> Can you let me know if the following patch fixes the issue for you ?
>>> It did for us on one of our machines that we were investigating on.
>>
>> This fixes the issue for me as well, thanks!
>>
>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
> 
> OK, that's great.
> 
> But, I really don't think we can ask upstream to merge this patch to generic
> code when we don't have a good explanation for why it's necessary. At least I'm
> not going to ask anyone to do that :)
> 
> So Pretti can you either write a 100% convincing explanation of why this patch
> is correct in the general case, or (preferably) do some more investigating to
> work out what Alexey's bug actually is.

Yes will do so. Its better to investigate where precisely is the bug.
This patch helped me narrow down on the buggy scenario.

Regards
Preeti U Murthy
> 
> cheers
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-16  8:56       ` Preeti U Murthy
@ 2015-01-16  9:10         ` Preeti U Murthy
  2015-01-22  5:29           ` Michael Ellerman
  0 siblings, 1 reply; 12+ messages in thread
From: Preeti U Murthy @ 2015-01-16  9:10 UTC (permalink / raw)
  To: Michael Ellerman, Alexey Kardashevskiy
  Cc: Shreyas B. Prabhu, linuxppc-dev@lists.ozlabs.org, Anton Blanchard,
	Paul Mackerras

On 01/16/2015 02:26 PM, Preeti U Murthy wrote:
> On 01/16/2015 08:34 AM, Michael Ellerman wrote:
>> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
>>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
>>>> Hi Alexey,
>>>>
>>>> Can you let me know if the following patch fixes the issue for you ?
>>>> It did for us on one of our machines that we were investigating on.
>>>
>>> This fixes the issue for me as well, thanks!
>>>
>>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
>>
>> OK, that's great.
>>
>> But, I really don't think we can ask upstream to merge this patch to generic
>> code when we don't have a good explanation for why it's necessary. At least I'm
>> not going to ask anyone to do that :)
>>
>> So Pretti can you either write a 100% convincing explanation of why this patch
>> is correct in the general case, or (preferably) do some more investigating to
>> work out what Alexey's bug actually is.
> 
> Yes will do so. Its better to investigate where precisely is the bug.
> This patch helped me narrow down on the buggy scenario.

On a side note, while I was tracking the race condition, I noticed that
in the final stage of the cpu offline path, after the state of the
hotplugged cpu is set to CPU_DEAD, we check if there were interrupts
delivered during the soft disabled state and service them if there were.
It makes sense to check for pending interrupts in the idle path. In the
offline path however, this did not look right to me at first glance. Am
I missing something ?

Regards
Preeti U Murthy

> 
> Regards
> Preeti U Murthy
>>
>> cheers
>>
>>
>> _______________________________________________
>> Linuxppc-dev mailing list
>> Linuxppc-dev@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-dev
>>
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-16  9:10         ` Preeti U Murthy
@ 2015-01-22  5:29           ` Michael Ellerman
  2015-01-22  6:31             ` Preeti U Murthy
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2015-01-22  5:29 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Alexey Kardashevskiy, Shreyas B. Prabhu,
	linuxppc-dev@lists.ozlabs.org, Anton Blanchard, Paul Mackerras

On Fri, 2015-01-16 at 14:40 +0530, Preeti U Murthy wrote:
> On 01/16/2015 02:26 PM, Preeti U Murthy wrote:
> > On 01/16/2015 08:34 AM, Michael Ellerman wrote:
> >> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
> >>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
> >>>> Hi Alexey,
> >>>>
> >>>> Can you let me know if the following patch fixes the issue for you ?
> >>>> It did for us on one of our machines that we were investigating on.
> >>>
> >>> This fixes the issue for me as well, thanks!
> >>>
> >>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
> >>
> >> OK, that's great.
> >>
> >> But, I really don't think we can ask upstream to merge this patch to generic
> >> code when we don't have a good explanation for why it's necessary. At least I'm
> >> not going to ask anyone to do that :)
> >>
> >> So Pretti can you either write a 100% convincing explanation of why this patch
> >> is correct in the general case, or (preferably) do some more investigating to
> >> work out what Alexey's bug actually is.
> > 
> > Yes will do so. Its better to investigate where precisely is the bug.
> > This patch helped me narrow down on the buggy scenario.
> 
> On a side note, while I was tracking the race condition, I noticed that
> in the final stage of the cpu offline path, after the state of the
> hotplugged cpu is set to CPU_DEAD, we check if there were interrupts
> delivered during the soft disabled state and service them if there were.
> It makes sense to check for pending interrupts in the idle path. In the
> offline path however, this did not look right to me at first glance. Am
> I missing something ?

That does sound a bit fishy.

I guess we're just assuming that all interrupts have been migrated away prior
to the offline?

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-22  5:29           ` Michael Ellerman
@ 2015-01-22  6:31             ` Preeti U Murthy
  0 siblings, 0 replies; 12+ messages in thread
From: Preeti U Murthy @ 2015-01-22  6:31 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Alexey Kardashevskiy, Shreyas B. Prabhu,
	linuxppc-dev@lists.ozlabs.org, Anton Blanchard, Paul Mackerras

On 01/22/2015 10:59 AM, Michael Ellerman wrote:
> On Fri, 2015-01-16 at 14:40 +0530, Preeti U Murthy wrote:
>> On 01/16/2015 02:26 PM, Preeti U Murthy wrote:
>>> On 01/16/2015 08:34 AM, Michael Ellerman wrote:
>>>> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
>>>>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
>>>>>> Hi Alexey,
>>>>>>
>>>>>> Can you let me know if the following patch fixes the issue for you ?
>>>>>> It did for us on one of our machines that we were investigating on.
>>>>>
>>>>> This fixes the issue for me as well, thanks!
>>>>>
>>>>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
>>>>
>>>> OK, that's great.
>>>>
>>>> But, I really don't think we can ask upstream to merge this patch to generic
>>>> code when we don't have a good explanation for why it's necessary. At least I'm
>>>> not going to ask anyone to do that :)
>>>>
>>>> So Pretti can you either write a 100% convincing explanation of why this patch
>>>> is correct in the general case, or (preferably) do some more investigating to
>>>> work out what Alexey's bug actually is.
>>>
>>> Yes will do so. Its better to investigate where precisely is the bug.
>>> This patch helped me narrow down on the buggy scenario.
>>
>> On a side note, while I was tracking the race condition, I noticed that
>> in the final stage of the cpu offline path, after the state of the
>> hotplugged cpu is set to CPU_DEAD, we check if there were interrupts
>> delivered during the soft disabled state and service them if there were.
>> It makes sense to check for pending interrupts in the idle path. In the
>> offline path however, this did not look right to me at first glance. Am
>> I missing something ?
> 
> That does sound a bit fishy.
> 
> I guess we're just assuming that all interrupts have been migrated away prior
> to the offline?

Yes they have. Interrupts to start a guest will also not get delivered
because the hwthread_state is not set to nap yet at that point.

Regards
Preeti U Murthy
> 
> cheers
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-16  3:04     ` Michael Ellerman
  2015-01-16  8:56       ` Preeti U Murthy
@ 2015-01-17 13:39       ` Preeti U Murthy
  2015-01-18 16:50         ` Preeti U Murthy
  1 sibling, 1 reply; 12+ messages in thread
From: Preeti U Murthy @ 2015-01-17 13:39 UTC (permalink / raw)
  To: Michael Ellerman, Alexey Kardashevskiy
  Cc: Shreyas B. Prabhu, Paul Mackerras, Anton Blanchard,
	Paul E. McKenney, linuxppc-dev@lists.ozlabs.org

On 01/16/2015 08:34 AM, Michael Ellerman wrote:
> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
>>> Hi Alexey,
>>>
>>> Can you let me know if the following patch fixes the issue for you ?
>>> It did for us on one of our machines that we were investigating on.
>>
>> This fixes the issue for me as well, thanks!
>>
>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
> 
> OK, that's great.
> 
> But, I really don't think we can ask upstream to merge this patch to generic
> code when we don't have a good explanation for why it's necessary. At least I'm
> not going to ask anyone to do that :)
> 
> So Pretti can you either write a 100% convincing explanation of why this patch
> is correct in the general case, or (preferably) do some more investigating to
> work out what Alexey's bug actually is.

On further investigation, I found that the issue lies in the latency of
cpu hotplug operation, specifically the time taken for the offline cpu
to enter powersave mode.

The time between the beginning of the cpu hotplug operation and the
beginning of __cpu_die() operation (which is one of the last stages of
cpu hotplug) takes around a maximum of 40ms. Although this is not
causing softlockups, it is quite a large duration.

The more serious issue is the time taken for __cpu_die() operation to
complete. The __cpu_die() operation waits for the offline cpu to set its
state to CPU_DEAD before entering powersave state. This time varies from
4s to a maximum of 200s! It is not this bad always but it does happen
quite a few times. It is during these times that we observe softlockups.
I added trace prints throughout the cpu hotplug code to measure these
numbers. This delay is causing the softlockup and here is why.

If the cpu going offline is the one broadcasting wakeups to cpus in
fastsleep, it queues the broadcast timer on another cpu during the
CPU_DEAD phase. The CPU_DEAD notifiers are run only after the
__cpu_die() operation completes, which is taking a long time as
mentioned above. So between the time irqs are migrated off the about to
go offline cpu and CPU_DEAD stage, no cpu can be woken up. The above
numbers show that this can be a horridly long time. Hence the next time
that they get woken up the unnatural idle time is detected and
softlockup triggers.

The patch on this thread that I proposed covered up the problem by
allowing the remaining cpus to freshly reevaluate their wakeups after
the stop machine phase without having to depend on the previous
broadcast state.So it did not matter what the previously appointed
broadcast cpu was upto.However there are still corner cases which cannot
get solved with this patch. And understandably because it is not
addressing the core issue, which is how to get around the latency issue
of cpu hotplug.

There can be ways in which the broadcast timer be migrated in time
during hotplug to get around the softlockups, but the latency of the cpu
hotplug operation looks like a serious issue. Has anybody observed or
explicitly instrumented cpu hotplug operation before and happened to
notice the large time duration required for its completion?

Ccing Paul.

Thanks

Regards
Preeti U Murthy
> 
> cheers
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: offlining cpus breakage
  2015-01-17 13:39       ` Preeti U Murthy
@ 2015-01-18 16:50         ` Preeti U Murthy
  0 siblings, 0 replies; 12+ messages in thread
From: Preeti U Murthy @ 2015-01-18 16:50 UTC (permalink / raw)
  To: Michael Ellerman, Alexey Kardashevskiy
  Cc: Shreyas B. Prabhu, Paul Mackerras, Anton Blanchard,
	Paul E. McKenney, linuxppc-dev@lists.ozlabs.org

On 01/17/2015 07:09 PM, Preeti U Murthy wrote:
> On 01/16/2015 08:34 AM, Michael Ellerman wrote:
>> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote:
>>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote:
>>>> Hi Alexey,
>>>>
>>>> Can you let me know if the following patch fixes the issue for you ?
>>>> It did for us on one of our machines that we were investigating on.
>>>
>>> This fixes the issue for me as well, thanks!
>>>
>>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>	
>>
>> OK, that's great.
>>
>> But, I really don't think we can ask upstream to merge this patch to generic
>> code when we don't have a good explanation for why it's necessary. At least I'm
>> not going to ask anyone to do that :)
>>
>> So Pretti can you either write a 100% convincing explanation of why this patch
>> is correct in the general case, or (preferably) do some more investigating to
>> work out what Alexey's bug actually is.
> 
> On further investigation, I found that the issue lies in the latency of
> cpu hotplug operation, specifically the time taken for the offline cpu
> to enter powersave mode.
> 
> The time between the beginning of the cpu hotplug operation and the
> beginning of __cpu_die() operation (which is one of the last stages of
> cpu hotplug) takes around a maximum of 40ms. Although this is not
> causing softlockups, it is quite a large duration.
> 
> The more serious issue is the time taken for __cpu_die() operation to
> complete. The __cpu_die() operation waits for the offline cpu to set its
> state to CPU_DEAD before entering powersave state. This time varies from
> 4s to a maximum of 200s! It is not this bad always but it does happen
> quite a few times. It is during these times that we observe softlockups.
> I added trace prints throughout the cpu hotplug code to measure these
> numbers. This delay is causing the softlockup and here is why.
> 
> If the cpu going offline is the one broadcasting wakeups to cpus in
> fastsleep, it queues the broadcast timer on another cpu during the
> CPU_DEAD phase. The CPU_DEAD notifiers are run only after the
> __cpu_die() operation completes, which is taking a long time as
> mentioned above. So between the time irqs are migrated off the about to
> go offline cpu and CPU_DEAD stage, no cpu can be woken up. The above
> numbers show that this can be a horridly long time. Hence the next time
> that they get woken up the unnatural idle time is detected and
> softlockup triggers.
> 
> The patch on this thread that I proposed covered up the problem by
> allowing the remaining cpus to freshly reevaluate their wakeups after
> the stop machine phase without having to depend on the previous
> broadcast state.So it did not matter what the previously appointed
> broadcast cpu was upto.However there are still corner cases which cannot
> get solved with this patch. And understandably because it is not
> addressing the core issue, which is how to get around the latency issue
> of cpu hotplug.
> 
> There can be ways in which the broadcast timer be migrated in time
> during hotplug to get around the softlockups, but the latency of the cpu
> hotplug operation looks like a serious issue. Has anybody observed or
> explicitly instrumented cpu hotplug operation before and happened to
> notice the large time duration required for its completion?
> 
> Ccing Paul.

Ok, finally the problem is clear. The latency observed during hotplug
was a result of a bug in the tick-broadcast framework in the cpu offline
path as was previously presumed. The problem description and the apt fix
is below.

Alexey, would you mind giving this patch a try yet again ? I will
post it out to mainline as soon as you confirm it fixes your issue.
This patch is a tad bit different from the previous one.

Thanks!

Regards
Preeti U Murthy

-------------------------start patch-----------------------------------

tick/broadcast: Make movement of broadcast hrtimer robust against hotplug

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Today if a cpu handling broadcasting of wakeups goes offline, it hands over
the job of broadcasting to another cpu in the CPU_DEAD phase. The CPU_DEAD
notifiers are run only after the offline cpu sets its state as CPU_DEAD.
Meanwhile, the kthread doing the offline is scheduled out while waiting for
this transition by queuing a timer. This is fatal because if the cpu on which
this kthread was running has no other work queued on it, it can re-enter deep
idle state, since it sees that a broadcast cpu still exists. However the broadcast
wakeup will never come since the cpu which was handling it is offline, and this cpu
never wakes up to see this because its in deep idle state.

Fix this by setting the broadcast timer to a max value so as to force the cpus
entering deep idle states henceforth to freshly nominate the broadcast cpu. More
importantly this has to be done in the CPU_DYING phase so that it is visible to
all cpus right after exiting stop_machine, which is when they can re-enter idle.
This ensures that handover of the broadcast duty falls in place on offline, without
having to do it explicitly.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 kernel/time/clockevents.c    |    2 +-
 kernel/time/tick-broadcast.c |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 5544990..f3907c9 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg)
 
 	case CLOCK_EVT_NOTIFY_CPU_DYING:
 		tick_handover_do_timer(arg);
+		tick_shutdown_broadcast_oneshot(arg);
 		break;
 
 	case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg)
 		break;
 
 	case CLOCK_EVT_NOTIFY_CPU_DEAD:
-		tick_shutdown_broadcast_oneshot(arg);
 		tick_shutdown_broadcast(arg);
 		tick_shutdown(arg);
 		/*
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 066f0ec..e9c1d9b 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu)
 
 	if (!bc || !broadcast_needs_cpu(bc, deadcpu))
 		return;
-	/* This moves the broadcast assignment to this cpu */
-	clockevents_program_event(bc, bc->next_event, 1);
+	/* This allows fresh nomination of broadcast cpu */
+	bc->next_event.tv64 = KTIME_MAX;
 }
 
 /*
> 
> Thanks
> 
> Regards
> Preeti U Murthy
>>
>> cheers
>>
>>
>> _______________________________________________
>> Linuxppc-dev mailing list
>> Linuxppc-dev@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-dev
>>
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-01-22  6:31 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-07  9:37 offlining cpus breakage Alexey Kardashevskiy
2015-01-14  4:20 ` Shreyas B Prabhu
2015-01-14 11:03 ` Shreyas B Prabhu
2015-01-15 13:22 ` Preeti U Murthy
2015-01-16  0:28   ` Alexey Kardashevskiy
2015-01-16  3:04     ` Michael Ellerman
2015-01-16  8:56       ` Preeti U Murthy
2015-01-16  9:10         ` Preeti U Murthy
2015-01-22  5:29           ` Michael Ellerman
2015-01-22  6:31             ` Preeti U Murthy
2015-01-17 13:39       ` Preeti U Murthy
2015-01-18 16:50         ` Preeti U Murthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).