[PATCH] Fix spurious hard lockup events while in debugger

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Fix spurious hard lockup events while in debugger
@ 2015-12-15  3:30 Jeff Merkey
  2015-12-16 16:57 ` Jeff Merkey
  2015-12-16 17:22 ` Jeff Merkey
  0 siblings, 2 replies; 5+ messages in thread
From: Jeff Merkey @ 2015-12-15  3:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, uobergfe, dzickus, atomlin, cmetcalf, fweisbec

The current touch_nmi_watchdog() function in /kernel/watchdog.c does
not always catch all cases when a processor is spinning in the nmi
handler inside either KGDB, KDB, or MDB, in particular, the case where
a processor is being held by a debugger inside an int1 handler.

The hrtimer_interrupts_saved count can still end up matching the
hrtime value in some cases, resulting in the hard lockup detector
tagging processors inside a debugger and executing a panic.

The patch below corrects this problem.  I did not add this to
the touch_nmi_function directly becuase of possible affects on
timing issues since the function is widely used by drivers and
modules.

I have tested this patch and it fixes the problem for kernel debuggers
stopping errant hard lockup events when processors are spinning inside
the debugger.

Signed-off-by: Jeff Merkey <linux.mdb@gmail.com>
---
 kernel/watchdog.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18f34cf..b682aab 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -283,6 +283,13 @@ static bool is_hardlockup(void)
 	__this_cpu_write(hrtimer_interrupts_saved, hrint);
 	return false;
 }
+
+void touch_hardlockup_watchdog(void)
+{
+	__this_cpu_write(hrtimer_interrupts_saved, 0);
+}
+EXPORT_SYMBOL_GPL(touch_hardlockup_watchdog);
+
 #endif

 static int is_softlockup(unsigned long touch_ts)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] Fix spurious hard lockup events while in debugger
  2015-12-15  3:30 [PATCH] Fix spurious hard lockup events while in debugger Jeff Merkey
@ 2015-12-16 16:57 ` Jeff Merkey
  2015-12-16 17:22 ` Jeff Merkey
  1 sibling, 0 replies; 5+ messages in thread
From: Jeff Merkey @ 2015-12-16 16:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, uobergfe, dzickus, atomlin, cmetcalf, fweisbec

On 12/14/15, Jeff Merkey <linux.mdb@gmail.com> wrote:
> The current touch_nmi_watchdog() function in /kernel/watchdog.c does
> not always catch all cases when a processor is spinning in the nmi
> handler inside either KGDB, KDB, or MDB, in particular, the case where
> a processor is being held by a debugger inside an int1 handler.
>
> The hrtimer_interrupts_saved count can still end up matching the
> hrtime value in some cases, resulting in the hard lockup detector
> tagging processors inside a debugger and executing a panic.
>
> The patch below corrects this problem.  I did not add this to
> the touch_nmi_function directly becuase of possible affects on
> timing issues since the function is widely used by drivers and
> modules.
>
> I have tested this patch and it fixes the problem for kernel debuggers
> stopping errant hard lockup events when processors are spinning inside
> the debugger.
>
> Signed-off-by: Jeff Merkey <linux.mdb@gmail.com>
> ---
>  kernel/watchdog.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 18f34cf..b682aab 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -283,6 +283,13 @@ static bool is_hardlockup(void)
>  	__this_cpu_write(hrtimer_interrupts_saved, hrint);
>  	return false;
>  }
> +
> +void touch_hardlockup_watchdog(void)
> +{
> +	__this_cpu_write(hrtimer_interrupts_saved, 0);
> +}
> +EXPORT_SYMBOL_GPL(touch_hardlockup_watchdog);
> +
>  #endif
>
>  static int is_softlockup(unsigned long touch_ts)
> --
> 1.8.3.1
>
>


I stared at the function that detects hardlockups until my eyes have
turned red and the code looks ok.  I am still trying to figure why
this is happening. I call touch_nmi_watchdog() without the other
function and the hard lockup fires off.

I'll try to debug it and determine why it is happening.  Having the
flag thing for the touch operating independent of the counter may be
the clue.  I'll look into it a little more today and see if I can
figure out why it is happening.

Jeff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Fix spurious hard lockup events while in debugger
  2015-12-15  3:30 [PATCH] Fix spurious hard lockup events while in debugger Jeff Merkey
  2015-12-16 16:57 ` Jeff Merkey
@ 2015-12-16 17:22 ` Jeff Merkey
  2015-12-16 17:50   ` Don Zickus
  2015-12-16 17:55   ` Jeff Merkey
  1 sibling, 2 replies; 5+ messages in thread
From: Jeff Merkey @ 2015-12-16 17:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, uobergfe, dzickus, atomlin, cmetcalf, fweisbec, tglx, mingo,
	hpa, x86, peterz, luto

On 12/14/15, Jeff Merkey <linux.mdb@gmail.com> wrote:
> The current touch_nmi_watchdog() function in /kernel/watchdog.c does
> not always catch all cases when a processor is spinning in the nmi
> handler inside either KGDB, KDB, or MDB, in particular, the case where
> a processor is being held by a debugger inside an int1 handler.
>
> The hrtimer_interrupts_saved count can still end up matching the
> hrtime value in some cases, resulting in the hard lockup detector
> tagging processors inside a debugger and executing a panic.
>
> The patch below corrects this problem.  I did not add this to
> the touch_nmi_function directly becuase of possible affects on
> timing issues since the function is widely used by drivers and
> modules.
>
> I have tested this patch and it fixes the problem for kernel debuggers
> stopping errant hard lockup events when processors are spinning inside
> the debugger.
>
> Signed-off-by: Jeff Merkey <linux.mdb@gmail.com>
> ---
>  kernel/watchdog.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 18f34cf..b682aab 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -283,6 +283,13 @@ static bool is_hardlockup(void)
>  	__this_cpu_write(hrtimer_interrupts_saved, hrint);
>  	return false;
>  }
> +
> +void touch_hardlockup_watchdog(void)
> +{
> +	__this_cpu_write(hrtimer_interrupts_saved, 0);
> +}
> +EXPORT_SYMBOL_GPL(touch_hardlockup_watchdog);
> +
>  #endif
>
>  static int is_softlockup(unsigned long touch_ts)
> --
> 1.8.3.1
>
>

I got to the bottom of it.  It's related to the hardware I am using.
One of the processors is faulting and hanging due to an existing bug
in the hw_breakpoint handler not setting the resume flag (I have
previously reported it and submitted a patch).  This breaks your code,
but there's nothing you can do about it.

There is a severe bug in hw_breakpoint.c that causes int1 recursion
and this whole "lazy debug register switching" nonsense does not work
properly.  I am probably the first person to actually test this code
path robustly.  I applied the patch that fixes this bug in
hw_breakpoint.c and the problem with your code firing off and ignoring
the touch flag
went away.

Jeff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Fix spurious hard lockup events while in debugger
  2015-12-16 17:22 ` Jeff Merkey
@ 2015-12-16 17:50   ` Don Zickus
  2015-12-16 17:55   ` Jeff Merkey
  1 sibling, 0 replies; 5+ messages in thread
From: Don Zickus @ 2015-12-16 17:50 UTC (permalink / raw)
  To: Jeff Merkey
  Cc: linux-kernel, akpm, uobergfe, atomlin, cmetcalf, fweisbec, tglx,
	mingo, hpa, x86, peterz, luto

On Wed, Dec 16, 2015 at 10:22:18AM -0700, Jeff Merkey wrote:
> On 12/14/15, Jeff Merkey <linux.mdb@gmail.com> wrote:
> > The current touch_nmi_watchdog() function in /kernel/watchdog.c does
> > not always catch all cases when a processor is spinning in the nmi
> > handler inside either KGDB, KDB, or MDB, in particular, the case where
> > a processor is being held by a debugger inside an int1 handler.
> >
> > The hrtimer_interrupts_saved count can still end up matching the
> > hrtime value in some cases, resulting in the hard lockup detector
> > tagging processors inside a debugger and executing a panic.
> >
> > The patch below corrects this problem.  I did not add this to
> > the touch_nmi_function directly becuase of possible affects on
> > timing issues since the function is widely used by drivers and
> > modules.
> >
> > I have tested this patch and it fixes the problem for kernel debuggers
> > stopping errant hard lockup events when processors are spinning inside
> > the debugger.
> >
> > Signed-off-by: Jeff Merkey <linux.mdb@gmail.com>
> > ---
> >  kernel/watchdog.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > index 18f34cf..b682aab 100644
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -283,6 +283,13 @@ static bool is_hardlockup(void)
> >  	__this_cpu_write(hrtimer_interrupts_saved, hrint);
> >  	return false;
> >  }
> > +
> > +void touch_hardlockup_watchdog(void)
> > +{
> > +	__this_cpu_write(hrtimer_interrupts_saved, 0);
> > +}
> > +EXPORT_SYMBOL_GPL(touch_hardlockup_watchdog);
> > +
> >  #endif
> >
> >  static int is_softlockup(unsigned long touch_ts)
> > --
> > 1.8.3.1
> >
> >
> 
> I got to the bottom of it.  It's related to the hardware I am using.
> One of the processors is faulting and hanging due to an existing bug
> in the hw_breakpoint handler not setting the resume flag (I have
> previously reported it and submitted a patch).  This breaks your code,
> but there's nothing you can do about it.
> 
> There is a severe bug in hw_breakpoint.c that causes int1 recursion
> and this whole "lazy debug register switching" nonsense does not work
> properly.  I am probably the first person to actually test this code
> path robustly.  I applied the patch that fixes this bug in
> hw_breakpoint.c and the problem with your code firing off and ignoring
> the touch flag
> went away.

Ah, good to know.  Thanks!  I'll drop this patch then.

Cheers,
Don

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Fix spurious hard lockup events while in debugger
  2015-12-16 17:22 ` Jeff Merkey
  2015-12-16 17:50   ` Don Zickus
@ 2015-12-16 17:55   ` Jeff Merkey
  1 sibling, 0 replies; 5+ messages in thread
From: Jeff Merkey @ 2015-12-16 17:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, uobergfe, dzickus, atomlin, cmetcalf, fweisbec, tglx, mingo,
	hpa, x86, peterz, luto

On 12/16/15, Jeff Merkey <linux.mdb@gmail.com> wrote:
> On 12/14/15, Jeff Merkey <linux.mdb@gmail.com> wrote:
>> The current touch_nmi_watchdog() function in /kernel/watchdog.c does
>> not always catch all cases when a processor is spinning in the nmi
>> handler inside either KGDB, KDB, or MDB, in particular, the case where
>> a processor is being held by a debugger inside an int1 handler.
>>
>> The hrtimer_interrupts_saved count can still end up matching the
>> hrtime value in some cases, resulting in the hard lockup detector
>> tagging processors inside a debugger and executing a panic.
>>
>> The patch below corrects this problem.  I did not add this to
>> the touch_nmi_function directly becuase of possible affects on
>> timing issues since the function is widely used by drivers and
>> modules.
>>
>> I have tested this patch and it fixes the problem for kernel debuggers
>> stopping errant hard lockup events when processors are spinning inside
>> the debugger.
>>
>> Signed-off-by: Jeff Merkey <linux.mdb@gmail.com>
>> ---
>>  kernel/watchdog.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index 18f34cf..b682aab 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -283,6 +283,13 @@ static bool is_hardlockup(void)
>>  	__this_cpu_write(hrtimer_interrupts_saved, hrint);
>>  	return false;
>>  }
>> +
>> +void touch_hardlockup_watchdog(void)
>> +{
>> +	__this_cpu_write(hrtimer_interrupts_saved, 0);
>> +}
>> +EXPORT_SYMBOL_GPL(touch_hardlockup_watchdog);
>> +
>>  #endif
>>
>>  static int is_softlockup(unsigned long touch_ts)
>> --
>> 1.8.3.1
>>
>>
>
> I got to the bottom of it.  It's related to the hardware I am using.
> One of the processors is faulting and hanging due to an existing bug
> in the hw_breakpoint handler not setting the resume flag (I have
> previously reported it and submitted a patch).  This breaks your code,
> but there's nothing you can do about it.
>
> There is a severe bug in hw_breakpoint.c that causes int1 recursion
> and this whole "lazy debug register switching" nonsense does not work
> properly.  I am probably the first person to actually test this code
> path robustly.  I applied the patch that fixes this bug in
> hw_breakpoint.c and the problem with your code firing off and ignoring
> the touch flag
> went away.
>
> Jeff
>

Wow, I figured it out.  What's really needed here is the ability to
touch all the processors from just one processor.  That's what's
missing.  This per processor nonsense doesn't fly here.  A debugger
needs to be able to turn off your stuff (it needs an on/off switch)
completely when needed.

I'll submit a patch for that.  I'll just maintain the working version
in my patch for the debugger so people can get a working, stable,
debugable kernel and not a broken one until fixes start showing up in
the tree.

Jeff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-12-16 17:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-15  3:30 [PATCH] Fix spurious hard lockup events while in debugger Jeff Merkey
2015-12-16 16:57 ` Jeff Merkey
2015-12-16 17:22 ` Jeff Merkey
2015-12-16 17:50   ` Don Zickus
2015-12-16 17:55   ` Jeff Merkey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox