* offlining cpus breakage @ 2015-01-07 9:37 Alexey Kardashevskiy 2015-01-14 4:20 ` Shreyas B Prabhu ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Alexey Kardashevskiy @ 2015-01-07 9:37 UTC (permalink / raw) To: linuxppc-dev@lists.ozlabs.org, Shreyas B. Prabhu; +Cc: Paul Mackerras Hi! "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel (sha1 bdec419): NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0] or INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 2 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616, q=1441) and many others, all about lockups I did bisecting and found out that reverting these helps: 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle support for offline cpus 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle states management 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline CPUs to enter deep idle states btw reverting just two of them produces a compile error. It is pseries_le_defconfig, POWER8 machine: timebase : 512000000 platform : PowerNV model : palmetto machine : PowerNV palmetto firmware : OPAL v3 Please help to fix it. Thanks. -- Alexey ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-07 9:37 offlining cpus breakage Alexey Kardashevskiy @ 2015-01-14 4:20 ` Shreyas B Prabhu 2015-01-14 11:03 ` Shreyas B Prabhu 2015-01-15 13:22 ` Preeti U Murthy 2 siblings, 0 replies; 12+ messages in thread From: Shreyas B Prabhu @ 2015-01-14 4:20 UTC (permalink / raw) To: Alexey Kardashevskiy, linuxppc-dev@lists.ozlabs.org Cc: preeti U Murthy, Paul Mackerras Hi, On Wednesday 07 January 2015 03:07 PM, Alexey Kardashevskiy wrote: > Hi! > > "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel > (sha1 bdec419): > > NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0] > > or > > INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15 > 16 17 18 19 20 21 22 23 2 > 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616, > q=1441) > > and many others, all about lockups > > I did bisecting and found out that reverting these helps: > > 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle > support for offline cpus > 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle > states management > 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline > CPUs to enter deep idle states > > btw reverting just two of them produces a compile error. > > It is pseries_le_defconfig, POWER8 machine: > timebase : 512000000 > platform : PowerNV > model : palmetto > machine : PowerNV palmetto > firmware : OPAL v3 > > > Please help to fix it. Thanks. > > Upon investigation, we figured that the cpu is stuck in cpu_idle_poll loop in kernel/sched/idle.c leading us to believe the bug is in timer offload framework which fastsleep uses. Preeti and I are working on a fix. We'll post it out as soon as possible. Thanks, Shreyas ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-07 9:37 offlining cpus breakage Alexey Kardashevskiy 2015-01-14 4:20 ` Shreyas B Prabhu @ 2015-01-14 11:03 ` Shreyas B Prabhu 2015-01-15 13:22 ` Preeti U Murthy 2 siblings, 0 replies; 12+ messages in thread From: Shreyas B Prabhu @ 2015-01-14 11:03 UTC (permalink / raw) To: Alexey Kardashevskiy, linuxppc-dev@lists.ozlabs.org Cc: Preeti U Murthy, Paul Mackerras On Wednesday 07 January 2015 03:07 PM, Alexey Kardashevskiy wrote: > Hi! > > "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel > (sha1 bdec419): > > NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0] > > or > > INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15 > 16 17 18 19 20 21 22 23 2 > 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616, > q=1441) > > and many others, all about lockups > > I did bisecting and found out that reverting these helps: > > 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle > support for offline cpus > 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle > states management > 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline > CPUs to enter deep idle states > > btw reverting just two of them produces a compile error. > > It is pseries_le_defconfig, POWER8 machine: > timebase : 512000000 > platform : PowerNV > model : palmetto > machine : PowerNV palmetto > firmware : OPAL v3 > > The bug scenario is as follows: In fastsleep decrementer state is not maintained, thus a cpu entering fastsleep offloads its timer to a different cpu (lets call this broadcast cpu). Now in the event that this broadcast cpu is offlined, it assigns a new cpu with the task to handle broadcasting. If this new cpu is one of the cpus which had entered fastsleep, its decrementer will have been in an invalid state. This cpu has been woken up by a need resched ipi (to take up the task of broadcasting) as opposed to a broadcast ipi. The decrementer state is fixed only on a broadcast ipi and not on a need resched ipi. Because of this, its timers don't fire. Consequently it cannot wake up any cpu relying on broadcast ipi. This scenario of a cpu that takes up the task of broadcasting being in fastsleep is a corner case. This almost never happens on machines with more number of cores. This explains why Alexey was able to hit it easily on palmetto. We'll be posting out a fix for this soon. Thanks, Shreyas ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-07 9:37 offlining cpus breakage Alexey Kardashevskiy 2015-01-14 4:20 ` Shreyas B Prabhu 2015-01-14 11:03 ` Shreyas B Prabhu @ 2015-01-15 13:22 ` Preeti U Murthy 2015-01-16 0:28 ` Alexey Kardashevskiy 2 siblings, 1 reply; 12+ messages in thread From: Preeti U Murthy @ 2015-01-15 13:22 UTC (permalink / raw) To: Alexey Kardashevskiy, linuxppc-dev@lists.ozlabs.org, Shreyas B. Prabhu Cc: Paul Mackerras, Anton Blanchard Hi Alexey, Can you let me know if the following patch fixes the issue for you ? It did for us on one of our machines that we were investigating on. Anton, Can you let me know if the patch fixes the odd perf report that you observed? This is the latest fix that there is to it. -------------------------------START PATCH-------------------------------------- tick/broadcast-hrtimer : Make movement of broadcast hrtimer robust against cpu offline From: Preeti U Murthy <preeti@linux.vnet.ibm.com> When a cpu on which the broadcast hrtimer is queued goes offline, the hrtimer needs to be moved to another cpu. There maybe potential race conditions here, which can lead to the hrtimer not being queued on any cpu. There was a report on softlockups, with cpus being stuck in deep idle states, when smt mode switching was being done, especially on systems with smaller number of cpus [1]. This is hard to reproduce on a large machine because the chances that an offline cpu was the stand by cpu handling broadcast become lesser. Given the current code, the only situation that looks capable of triggering scenarios where broadcast IPIs are not delivered, is in the cpu offline path. I am at a loss to pin point the precise scenario. Therefore to avoid any possible race condition between cpu offline and movement of the broadcast hrtimer, this patch ensures that the act of keeping track of broadcasting wakeup IPIs is started afresh after a cpu offline operation. This is done by reseting the expiry time of the hrtimer to a max value. This has to be done in the CPU_DYING phase so that it is visible to all cpus right after exiting stop_machine. The rationale is that during cpu offline, since all cpus are woken up anyway to run stop_machine, we are only required to keep track of broadcasting to cpus that henceforth enter deep idle states. This ensures that the broadcast hrtimer gets positively queued on a cpu as long as there are cpus in deep idle states. It has to be noted that this code is in addition to retaining the logic that we have today in the broadcast code on an offline operation. If this logic fails to move the broadcast hrtimer due to a race condition we have the following patch to handle it right. [1]http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html There is no issue in programming the decrementer as was presumed and stated in this link. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> --- kernel/time/clockevents.c | 2 +- kernel/time/tick-broadcast.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: - tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..1f5eda6 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -677,6 +677,7 @@ static void broadcast_move_bc(int deadcpu) return; /* This moves the broadcast assignment to this cpu */ clockevents_program_event(bc, bc->next_event, 1); + bc->next_event.tv64 = KTIME_MAX; } /* ---------------------------------------END PATCH------------------------- Thanks Regards Preeti U Murthy On 01/07/2015 03:07 PM, Alexey Kardashevskiy wrote: > Hi! > > "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel > (sha1 bdec419): > > NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0] > > or > > INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15 > 16 17 18 19 20 21 22 23 2 > 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616, > q=1441) > > and many others, all about lockups > > I did bisecting and found out that reverting these helps: > > 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle > support for offline cpus > 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle > states management > 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline > CPUs to enter deep idle states > > btw reverting just two of them produces a compile error. > > It is pseries_le_defconfig, POWER8 machine: > timebase : 512000000 > platform : PowerNV > model : palmetto > machine : PowerNV palmetto > firmware : OPAL v3 > > > Please help to fix it. Thanks. > > ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-15 13:22 ` Preeti U Murthy @ 2015-01-16 0:28 ` Alexey Kardashevskiy 2015-01-16 3:04 ` Michael Ellerman 0 siblings, 1 reply; 12+ messages in thread From: Alexey Kardashevskiy @ 2015-01-16 0:28 UTC (permalink / raw) To: Preeti U Murthy, linuxppc-dev@lists.ozlabs.org, Shreyas B. Prabhu Cc: Paul Mackerras, Anton Blanchard On 01/16/2015 02:22 AM, Preeti U Murthy wrote: > Hi Alexey, > > Can you let me know if the following patch fixes the issue for you ? > It did for us on one of our machines that we were investigating on. This fixes the issue for me as well, thanks! Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> > Anton, > Can you let me know if the patch fixes the odd perf report that you observed? > This is the latest fix that there is to it. > > -------------------------------START PATCH-------------------------------------- > > tick/broadcast-hrtimer : Make movement of broadcast hrtimer robust against cpu offline > > From: Preeti U Murthy <preeti@linux.vnet.ibm.com> > > When a cpu on which the broadcast hrtimer is queued goes offline, the hrtimer > needs to be moved to another cpu. There maybe potential race conditions here, > which can lead to the hrtimer not being queued on any cpu. > > There was a report on softlockups, with cpus being stuck in deep idle states, > when smt mode switching was being done, especially on systems with smaller number of > cpus [1]. This is hard to reproduce on a large machine because the chances that an > offline cpu was the stand by cpu handling broadcast become lesser. Given the > current code, the only situation that looks capable of triggering scenarios where > broadcast IPIs are not delivered, is in the cpu offline path. I am > at a loss to pin point the precise scenario. > > Therefore to avoid any possible race condition between cpu offline and > movement of the broadcast hrtimer, this patch ensures that the act of keeping > track of broadcasting wakeup IPIs is started afresh after a cpu offline operation. > This is done by reseting the expiry time of the hrtimer to a max value. This > has to be done in the CPU_DYING phase so that it is visible to all cpus right > after exiting stop_machine. > > The rationale is that during cpu offline, since all cpus are woken up anyway > to run stop_machine, we are only required to keep track of broadcasting to cpus > that henceforth enter deep idle states. This ensures that the broadcast hrtimer > gets positively queued on a cpu as long as there are cpus in deep idle states. > > It has to be noted that this code is in addition to retaining the logic that we > have today in the broadcast code on an offline operation. If this logic fails to > move the broadcast hrtimer due to a race condition we have the following patch to > handle it right. > > [1]http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html > There is no issue in programming the decrementer as was presumed and stated in > this link. > > Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> > --- > kernel/time/clockevents.c | 2 +- > kernel/time/tick-broadcast.c | 1 + > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c > index 5544990..f3907c9 100644 > --- a/kernel/time/clockevents.c > +++ b/kernel/time/clockevents.c > @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) > > case CLOCK_EVT_NOTIFY_CPU_DYING: > tick_handover_do_timer(arg); > + tick_shutdown_broadcast_oneshot(arg); > break; > > case CLOCK_EVT_NOTIFY_SUSPEND: > @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) > break; > > case CLOCK_EVT_NOTIFY_CPU_DEAD: > - tick_shutdown_broadcast_oneshot(arg); > tick_shutdown_broadcast(arg); > tick_shutdown(arg); > /* > diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c > index 066f0ec..1f5eda6 100644 > --- a/kernel/time/tick-broadcast.c > +++ b/kernel/time/tick-broadcast.c > @@ -677,6 +677,7 @@ static void broadcast_move_bc(int deadcpu) > return; > /* This moves the broadcast assignment to this cpu */ > clockevents_program_event(bc, bc->next_event, 1); > + bc->next_event.tv64 = KTIME_MAX; > } > > /* > > ---------------------------------------END PATCH------------------------- > Thanks > > Regards > Preeti U Murthy > > On 01/07/2015 03:07 PM, Alexey Kardashevskiy wrote: >> Hi! >> >> "ppc64_cpu --smt=off" produces multiple error on the latest upstream kernel >> (sha1 bdec419): >> >> NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [swapper/20:0] >> >> or >> >> INFO: rcu_sched detected stalls on CPUs/tasks: { 2 7 8 9 10 11 12 13 14 15 >> 16 17 18 19 20 21 22 23 2 >> 4 25 26 27 28 29 30 31} (detected by 6, t=2102 jiffies, g=1617, c=1616, >> q=1441) >> >> and many others, all about lockups >> >> I did bisecting and found out that reverting these helps: >> >> 77b54e9f213f76a23736940cf94bcd765fc00f40 powernv/powerpc: Add winkle >> support for offline cpus >> 7cba160ad789a3ad7e68b92bf20eaad6ed171f80 powernv/cpuidle: Redesign idle >> states management >> 8eb8ac89a364305d05ad16be983b7890eb462cc3 powerpc/powernv: Enable Offline >> CPUs to enter deep idle states >> >> btw reverting just two of them produces a compile error. >> >> It is pseries_le_defconfig, POWER8 machine: >> timebase : 512000000 >> platform : PowerNV >> model : palmetto >> machine : PowerNV palmetto >> firmware : OPAL v3 >> >> >> Please help to fix it. Thanks. >> >> > -- Alexey ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-16 0:28 ` Alexey Kardashevskiy @ 2015-01-16 3:04 ` Michael Ellerman 2015-01-16 8:56 ` Preeti U Murthy 2015-01-17 13:39 ` Preeti U Murthy 0 siblings, 2 replies; 12+ messages in thread From: Michael Ellerman @ 2015-01-16 3:04 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Preeti U Murthy, Shreyas B. Prabhu, linuxppc-dev@lists.ozlabs.org, Anton Blanchard, Paul Mackerras On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: > On 01/16/2015 02:22 AM, Preeti U Murthy wrote: > > Hi Alexey, > > > > Can you let me know if the following patch fixes the issue for you ? > > It did for us on one of our machines that we were investigating on. > > This fixes the issue for me as well, thanks! > > Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> OK, that's great. But, I really don't think we can ask upstream to merge this patch to generic code when we don't have a good explanation for why it's necessary. At least I'm not going to ask anyone to do that :) So Pretti can you either write a 100% convincing explanation of why this patch is correct in the general case, or (preferably) do some more investigating to work out what Alexey's bug actually is. cheers ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-16 3:04 ` Michael Ellerman @ 2015-01-16 8:56 ` Preeti U Murthy 2015-01-16 9:10 ` Preeti U Murthy 2015-01-17 13:39 ` Preeti U Murthy 1 sibling, 1 reply; 12+ messages in thread From: Preeti U Murthy @ 2015-01-16 8:56 UTC (permalink / raw) To: Michael Ellerman, Alexey Kardashevskiy Cc: Shreyas B. Prabhu, linuxppc-dev@lists.ozlabs.org, Anton Blanchard, Paul Mackerras On 01/16/2015 08:34 AM, Michael Ellerman wrote: > On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: >> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: >>> Hi Alexey, >>> >>> Can you let me know if the following patch fixes the issue for you ? >>> It did for us on one of our machines that we were investigating on. >> >> This fixes the issue for me as well, thanks! >> >> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > OK, that's great. > > But, I really don't think we can ask upstream to merge this patch to generic > code when we don't have a good explanation for why it's necessary. At least I'm > not going to ask anyone to do that :) > > So Pretti can you either write a 100% convincing explanation of why this patch > is correct in the general case, or (preferably) do some more investigating to > work out what Alexey's bug actually is. Yes will do so. Its better to investigate where precisely is the bug. This patch helped me narrow down on the buggy scenario. Regards Preeti U Murthy > > cheers > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-16 8:56 ` Preeti U Murthy @ 2015-01-16 9:10 ` Preeti U Murthy 2015-01-22 5:29 ` Michael Ellerman 0 siblings, 1 reply; 12+ messages in thread From: Preeti U Murthy @ 2015-01-16 9:10 UTC (permalink / raw) To: Michael Ellerman, Alexey Kardashevskiy Cc: Shreyas B. Prabhu, linuxppc-dev@lists.ozlabs.org, Anton Blanchard, Paul Mackerras On 01/16/2015 02:26 PM, Preeti U Murthy wrote: > On 01/16/2015 08:34 AM, Michael Ellerman wrote: >> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: >>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: >>>> Hi Alexey, >>>> >>>> Can you let me know if the following patch fixes the issue for you ? >>>> It did for us on one of our machines that we were investigating on. >>> >>> This fixes the issue for me as well, thanks! >>> >>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> >> OK, that's great. >> >> But, I really don't think we can ask upstream to merge this patch to generic >> code when we don't have a good explanation for why it's necessary. At least I'm >> not going to ask anyone to do that :) >> >> So Pretti can you either write a 100% convincing explanation of why this patch >> is correct in the general case, or (preferably) do some more investigating to >> work out what Alexey's bug actually is. > > Yes will do so. Its better to investigate where precisely is the bug. > This patch helped me narrow down on the buggy scenario. On a side note, while I was tracking the race condition, I noticed that in the final stage of the cpu offline path, after the state of the hotplugged cpu is set to CPU_DEAD, we check if there were interrupts delivered during the soft disabled state and service them if there were. It makes sense to check for pending interrupts in the idle path. In the offline path however, this did not look right to me at first glance. Am I missing something ? Regards Preeti U Murthy > > Regards > Preeti U Murthy >> >> cheers >> >> >> _______________________________________________ >> Linuxppc-dev mailing list >> Linuxppc-dev@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/linuxppc-dev >> > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-16 9:10 ` Preeti U Murthy @ 2015-01-22 5:29 ` Michael Ellerman 2015-01-22 6:31 ` Preeti U Murthy 0 siblings, 1 reply; 12+ messages in thread From: Michael Ellerman @ 2015-01-22 5:29 UTC (permalink / raw) To: Preeti U Murthy Cc: Alexey Kardashevskiy, Shreyas B. Prabhu, linuxppc-dev@lists.ozlabs.org, Anton Blanchard, Paul Mackerras On Fri, 2015-01-16 at 14:40 +0530, Preeti U Murthy wrote: > On 01/16/2015 02:26 PM, Preeti U Murthy wrote: > > On 01/16/2015 08:34 AM, Michael Ellerman wrote: > >> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: > >>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: > >>>> Hi Alexey, > >>>> > >>>> Can you let me know if the following patch fixes the issue for you ? > >>>> It did for us on one of our machines that we were investigating on. > >>> > >>> This fixes the issue for me as well, thanks! > >>> > >>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >> > >> OK, that's great. > >> > >> But, I really don't think we can ask upstream to merge this patch to generic > >> code when we don't have a good explanation for why it's necessary. At least I'm > >> not going to ask anyone to do that :) > >> > >> So Pretti can you either write a 100% convincing explanation of why this patch > >> is correct in the general case, or (preferably) do some more investigating to > >> work out what Alexey's bug actually is. > > > > Yes will do so. Its better to investigate where precisely is the bug. > > This patch helped me narrow down on the buggy scenario. > > On a side note, while I was tracking the race condition, I noticed that > in the final stage of the cpu offline path, after the state of the > hotplugged cpu is set to CPU_DEAD, we check if there were interrupts > delivered during the soft disabled state and service them if there were. > It makes sense to check for pending interrupts in the idle path. In the > offline path however, this did not look right to me at first glance. Am > I missing something ? That does sound a bit fishy. I guess we're just assuming that all interrupts have been migrated away prior to the offline? cheers ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-22 5:29 ` Michael Ellerman @ 2015-01-22 6:31 ` Preeti U Murthy 0 siblings, 0 replies; 12+ messages in thread From: Preeti U Murthy @ 2015-01-22 6:31 UTC (permalink / raw) To: Michael Ellerman Cc: Alexey Kardashevskiy, Shreyas B. Prabhu, linuxppc-dev@lists.ozlabs.org, Anton Blanchard, Paul Mackerras On 01/22/2015 10:59 AM, Michael Ellerman wrote: > On Fri, 2015-01-16 at 14:40 +0530, Preeti U Murthy wrote: >> On 01/16/2015 02:26 PM, Preeti U Murthy wrote: >>> On 01/16/2015 08:34 AM, Michael Ellerman wrote: >>>> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: >>>>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: >>>>>> Hi Alexey, >>>>>> >>>>>> Can you let me know if the following patch fixes the issue for you ? >>>>>> It did for us on one of our machines that we were investigating on. >>>>> >>>>> This fixes the issue for me as well, thanks! >>>>> >>>>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>> >>>> OK, that's great. >>>> >>>> But, I really don't think we can ask upstream to merge this patch to generic >>>> code when we don't have a good explanation for why it's necessary. At least I'm >>>> not going to ask anyone to do that :) >>>> >>>> So Pretti can you either write a 100% convincing explanation of why this patch >>>> is correct in the general case, or (preferably) do some more investigating to >>>> work out what Alexey's bug actually is. >>> >>> Yes will do so. Its better to investigate where precisely is the bug. >>> This patch helped me narrow down on the buggy scenario. >> >> On a side note, while I was tracking the race condition, I noticed that >> in the final stage of the cpu offline path, after the state of the >> hotplugged cpu is set to CPU_DEAD, we check if there were interrupts >> delivered during the soft disabled state and service them if there were. >> It makes sense to check for pending interrupts in the idle path. In the >> offline path however, this did not look right to me at first glance. Am >> I missing something ? > > That does sound a bit fishy. > > I guess we're just assuming that all interrupts have been migrated away prior > to the offline? Yes they have. Interrupts to start a guest will also not get delivered because the hwthread_state is not set to nap yet at that point. Regards Preeti U Murthy > > cheers > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-16 3:04 ` Michael Ellerman 2015-01-16 8:56 ` Preeti U Murthy @ 2015-01-17 13:39 ` Preeti U Murthy 2015-01-18 16:50 ` Preeti U Murthy 1 sibling, 1 reply; 12+ messages in thread From: Preeti U Murthy @ 2015-01-17 13:39 UTC (permalink / raw) To: Michael Ellerman, Alexey Kardashevskiy Cc: Shreyas B. Prabhu, Paul Mackerras, Anton Blanchard, Paul E. McKenney, linuxppc-dev@lists.ozlabs.org On 01/16/2015 08:34 AM, Michael Ellerman wrote: > On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: >> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: >>> Hi Alexey, >>> >>> Can you let me know if the following patch fixes the issue for you ? >>> It did for us on one of our machines that we were investigating on. >> >> This fixes the issue for me as well, thanks! >> >> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > OK, that's great. > > But, I really don't think we can ask upstream to merge this patch to generic > code when we don't have a good explanation for why it's necessary. At least I'm > not going to ask anyone to do that :) > > So Pretti can you either write a 100% convincing explanation of why this patch > is correct in the general case, or (preferably) do some more investigating to > work out what Alexey's bug actually is. On further investigation, I found that the issue lies in the latency of cpu hotplug operation, specifically the time taken for the offline cpu to enter powersave mode. The time between the beginning of the cpu hotplug operation and the beginning of __cpu_die() operation (which is one of the last stages of cpu hotplug) takes around a maximum of 40ms. Although this is not causing softlockups, it is quite a large duration. The more serious issue is the time taken for __cpu_die() operation to complete. The __cpu_die() operation waits for the offline cpu to set its state to CPU_DEAD before entering powersave state. This time varies from 4s to a maximum of 200s! It is not this bad always but it does happen quite a few times. It is during these times that we observe softlockups. I added trace prints throughout the cpu hotplug code to measure these numbers. This delay is causing the softlockup and here is why. If the cpu going offline is the one broadcasting wakeups to cpus in fastsleep, it queues the broadcast timer on another cpu during the CPU_DEAD phase. The CPU_DEAD notifiers are run only after the __cpu_die() operation completes, which is taking a long time as mentioned above. So between the time irqs are migrated off the about to go offline cpu and CPU_DEAD stage, no cpu can be woken up. The above numbers show that this can be a horridly long time. Hence the next time that they get woken up the unnatural idle time is detected and softlockup triggers. The patch on this thread that I proposed covered up the problem by allowing the remaining cpus to freshly reevaluate their wakeups after the stop machine phase without having to depend on the previous broadcast state.So it did not matter what the previously appointed broadcast cpu was upto.However there are still corner cases which cannot get solved with this patch. And understandably because it is not addressing the core issue, which is how to get around the latency issue of cpu hotplug. There can be ways in which the broadcast timer be migrated in time during hotplug to get around the softlockups, but the latency of the cpu hotplug operation looks like a serious issue. Has anybody observed or explicitly instrumented cpu hotplug operation before and happened to notice the large time duration required for its completion? Ccing Paul. Thanks Regards Preeti U Murthy > > cheers > > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: offlining cpus breakage 2015-01-17 13:39 ` Preeti U Murthy @ 2015-01-18 16:50 ` Preeti U Murthy 0 siblings, 0 replies; 12+ messages in thread From: Preeti U Murthy @ 2015-01-18 16:50 UTC (permalink / raw) To: Michael Ellerman, Alexey Kardashevskiy Cc: Shreyas B. Prabhu, Paul Mackerras, Anton Blanchard, Paul E. McKenney, linuxppc-dev@lists.ozlabs.org On 01/17/2015 07:09 PM, Preeti U Murthy wrote: > On 01/16/2015 08:34 AM, Michael Ellerman wrote: >> On Fri, 2015-01-16 at 13:28 +1300, Alexey Kardashevskiy wrote: >>> On 01/16/2015 02:22 AM, Preeti U Murthy wrote: >>>> Hi Alexey, >>>> >>>> Can you let me know if the following patch fixes the issue for you ? >>>> It did for us on one of our machines that we were investigating on. >>> >>> This fixes the issue for me as well, thanks! >>> >>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> >> OK, that's great. >> >> But, I really don't think we can ask upstream to merge this patch to generic >> code when we don't have a good explanation for why it's necessary. At least I'm >> not going to ask anyone to do that :) >> >> So Pretti can you either write a 100% convincing explanation of why this patch >> is correct in the general case, or (preferably) do some more investigating to >> work out what Alexey's bug actually is. > > On further investigation, I found that the issue lies in the latency of > cpu hotplug operation, specifically the time taken for the offline cpu > to enter powersave mode. > > The time between the beginning of the cpu hotplug operation and the > beginning of __cpu_die() operation (which is one of the last stages of > cpu hotplug) takes around a maximum of 40ms. Although this is not > causing softlockups, it is quite a large duration. > > The more serious issue is the time taken for __cpu_die() operation to > complete. The __cpu_die() operation waits for the offline cpu to set its > state to CPU_DEAD before entering powersave state. This time varies from > 4s to a maximum of 200s! It is not this bad always but it does happen > quite a few times. It is during these times that we observe softlockups. > I added trace prints throughout the cpu hotplug code to measure these > numbers. This delay is causing the softlockup and here is why. > > If the cpu going offline is the one broadcasting wakeups to cpus in > fastsleep, it queues the broadcast timer on another cpu during the > CPU_DEAD phase. The CPU_DEAD notifiers are run only after the > __cpu_die() operation completes, which is taking a long time as > mentioned above. So between the time irqs are migrated off the about to > go offline cpu and CPU_DEAD stage, no cpu can be woken up. The above > numbers show that this can be a horridly long time. Hence the next time > that they get woken up the unnatural idle time is detected and > softlockup triggers. > > The patch on this thread that I proposed covered up the problem by > allowing the remaining cpus to freshly reevaluate their wakeups after > the stop machine phase without having to depend on the previous > broadcast state.So it did not matter what the previously appointed > broadcast cpu was upto.However there are still corner cases which cannot > get solved with this patch. And understandably because it is not > addressing the core issue, which is how to get around the latency issue > of cpu hotplug. > > There can be ways in which the broadcast timer be migrated in time > during hotplug to get around the softlockups, but the latency of the cpu > hotplug operation looks like a serious issue. Has anybody observed or > explicitly instrumented cpu hotplug operation before and happened to > notice the large time duration required for its completion? > > Ccing Paul. Ok, finally the problem is clear. The latency observed during hotplug was a result of a bug in the tick-broadcast framework in the cpu offline path as was previously presumed. The problem description and the apt fix is below. Alexey, would you mind giving this patch a try yet again ? I will post it out to mainline as soon as you confirm it fixes your issue. This patch is a tad bit different from the previous one. Thanks! Regards Preeti U Murthy -------------------------start patch----------------------------------- tick/broadcast: Make movement of broadcast hrtimer robust against hotplug From: Preeti U Murthy <preeti@linux.vnet.ibm.com> Today if a cpu handling broadcasting of wakeups goes offline, it hands over the job of broadcasting to another cpu in the CPU_DEAD phase. The CPU_DEAD notifiers are run only after the offline cpu sets its state as CPU_DEAD. Meanwhile, the kthread doing the offline is scheduled out while waiting for this transition by queuing a timer. This is fatal because if the cpu on which this kthread was running has no other work queued on it, it can re-enter deep idle state, since it sees that a broadcast cpu still exists. However the broadcast wakeup will never come since the cpu which was handling it is offline, and this cpu never wakes up to see this because its in deep idle state. Fix this by setting the broadcast timer to a max value so as to force the cpus entering deep idle states henceforth to freshly nominate the broadcast cpu. More importantly this has to be done in the CPU_DYING phase so that it is visible to all cpus right after exiting stop_machine, which is when they can re-enter idle. This ensures that handover of the broadcast duty falls in place on offline, without having to do it explicitly. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> --- kernel/time/clockevents.c | 2 +- kernel/time/tick-broadcast.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 5544990..f3907c9 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -568,6 +568,7 @@ int clockevents_notify(unsigned long reason, void *arg) case CLOCK_EVT_NOTIFY_CPU_DYING: tick_handover_do_timer(arg); + tick_shutdown_broadcast_oneshot(arg); break; case CLOCK_EVT_NOTIFY_SUSPEND: @@ -580,7 +581,6 @@ int clockevents_notify(unsigned long reason, void *arg) break; case CLOCK_EVT_NOTIFY_CPU_DEAD: - tick_shutdown_broadcast_oneshot(arg); tick_shutdown_broadcast(arg); tick_shutdown(arg); /* diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 066f0ec..e9c1d9b 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -675,8 +675,8 @@ static void broadcast_move_bc(int deadcpu) if (!bc || !broadcast_needs_cpu(bc, deadcpu)) return; - /* This moves the broadcast assignment to this cpu */ - clockevents_program_event(bc, bc->next_event, 1); + /* This allows fresh nomination of broadcast cpu */ + bc->next_event.tv64 = KTIME_MAX; } /* > > Thanks > > Regards > Preeti U Murthy >> >> cheers >> >> >> _______________________________________________ >> Linuxppc-dev mailing list >> Linuxppc-dev@lists.ozlabs.org >> https://lists.ozlabs.org/listinfo/linuxppc-dev >> > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev > ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-01-22 6:31 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-01-07 9:37 offlining cpus breakage Alexey Kardashevskiy 2015-01-14 4:20 ` Shreyas B Prabhu 2015-01-14 11:03 ` Shreyas B Prabhu 2015-01-15 13:22 ` Preeti U Murthy 2015-01-16 0:28 ` Alexey Kardashevskiy 2015-01-16 3:04 ` Michael Ellerman 2015-01-16 8:56 ` Preeti U Murthy 2015-01-16 9:10 ` Preeti U Murthy 2015-01-22 5:29 ` Michael Ellerman 2015-01-22 6:31 ` Preeti U Murthy 2015-01-17 13:39 ` Preeti U Murthy 2015-01-18 16:50 ` Preeti U Murthy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).