Debugging a hard lockup with no symptoms

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Debugging a hard lockup with no symptoms
@ 2010-04-14  7:26 Martin Shepherd
  2010-04-15 12:59 ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: Martin Shepherd @ 2010-04-14  7:26 UTC (permalink / raw)
  To: linux-rt-users

I have been experiencing hard lockups running a real-time application
under preempt-rt. Having originally had this problem while running
under 2.6.29.4-rt16, today I upgraded to 2.6.31.12-rt21, but the
problem persisted. Under both kernels, the computer simply freezes,
usually after a few hours of otherwise flawless operation. Nothing
appears on the serial console or in the system log when the system
freezes. Unfortunately, turning on the NMI watchdog stops the freezes
from occurring at all, such that I can't force an Oops that way.

I have tried running memtest86 on the RAM, without detecting any
memory errors, and I have verified that the same problem occurs on two
different (but essentially identical) computers.

I wonder whether there might be a clue in the fact that turning on the
NMI watchdog stops the freezes from occuring. Turning on the watchdog
unfortunately turns off tickless mode, which I need. According to the
boot-time messages, tickless is turned off because the local APIC is
non-functional (presumably because the NMI watchdog is using it). What
kind of bugs would be more likely to be seen when running under
tickless?

Could anybody give me any ideas on how to further debug this problem?
I have been trying to figure this out for weeks, but I haven't found
any clues.

In case it is important, the CPU is a 1.8GHz Intel Celeron, on a
Foxconn motherboard with an Intel G31 chipset, and Intel GMA 3100
onboard graphics. I am running the kernel (downloaded from kernel.org)
under Unbuntu 9.10. The computer also hosts two commercial digital I/O
boards, both generating interrupts, and one commercial analog I/O
board.

Thank you,

Martin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Debugging a hard lockup with no symptoms
  2010-04-14  7:26 Debugging a hard lockup with no symptoms Martin Shepherd
@ 2010-04-15 12:59 ` Thomas Gleixner
  2010-04-15 16:08   ` Martin Shepherd
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Thomas Gleixner @ 2010-04-15 12:59 UTC (permalink / raw)
  To: Martin Shepherd; +Cc: linux-rt-users

On Wed, 14 Apr 2010, Martin Shepherd wrote:

> I have been experiencing hard lockups running a real-time application
> under preempt-rt. Having originally had this problem while running
> under 2.6.29.4-rt16, today I upgraded to 2.6.31.12-rt21, but the
> problem persisted. Under both kernels, the computer simply freezes,
> usually after a few hours of otherwise flawless operation. Nothing
> appears on the serial console or in the system log when the system
> freezes. Unfortunately, turning on the NMI watchdog stops the freezes
> from occurring at all, such that I can't force an Oops that way.
> 
> I have tried running memtest86 on the RAM, without detecting any
> memory errors, and I have verified that the same problem occurs on two
> different (but essentially identical) computers.
> 
> I wonder whether there might be a clue in the fact that turning on the
> NMI watchdog stops the freezes from occuring. Turning on the watchdog
> unfortunately turns off tickless mode, which I need. According to the
> boot-time messages, tickless is turned off because the local APIC is
> non-functional (presumably because the NMI watchdog is using it). What
> kind of bugs would be more likely to be seen when running under
> tickless?

Can you try nmi_watchdog=1 ? That keeps the tickless mode alive.

> Could anybody give me any ideas on how to further debug this problem?
> I have been trying to figure this out for weeks, but I haven't found
> any clues.
> 
> In case it is important, the CPU is a 1.8GHz Intel Celeron, on a
> Foxconn motherboard with an Intel G31 chipset, and Intel GMA 3100
> onboard graphics. I am running the kernel (downloaded from kernel.org)
> under Unbuntu 9.10. The computer also hosts two commercial digital I/O
> boards, both generating interrupts, and one commercial analog I/O
> board.

Does the problem reproduce when you disable those boards ? Do you have
the source of the drivers ? Did you ever run with lockdep enabled
(CONFIG_PROVE_LOCKING=y) ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Debugging a hard lockup with no symptoms
  2010-04-15 12:59 ` Thomas Gleixner
@ 2010-04-15 16:08   ` Martin Shepherd
  2010-04-15 19:19   ` Martin Shepherd
  2010-04-15 22:47   ` Martin Shepherd
  2 siblings, 0 replies; 5+ messages in thread
From: Martin Shepherd @ 2010-04-15 16:08 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-rt-users

On Thu, 15 Apr 2010, Thomas Gleixner wrote:
> Can you try nmi_watchdog=1 ? That keeps the tickless mode alive.

It was nmi_watchdog=1 that turned off tickless. Perhaps you mean,
nmi_watchdog=2? I haven't tried that.

> Does the problem reproduce when you disable those boards ?

The real-time code that apparently causes the lockups is controlling
hardware continuously via those commercial boards. So unfortunately,
if I disable them, then I can't run the code that is causing the
problem. In particular, they provide the hardware interrupts that
drive the code, and servo feedback that determines what the code does
next.

> Do you have the source of the drivers ?

Yes, I wrote my own drivers for these boards. So this ought to be easy
to solve, if I knew what to look for in my code. Yesterday I found one
thing that might be a problem, and I hope to get a chance to test this
today. In one of my two interrupt threads, I was calling
wake_up_interruptible() before writing the PCI registers that clear
the interrupt on the board. When I wrote this, I assumed that the
interrupt handler would always finish before the scheduler came back
into play, but I am wondering whether this is still true with threaded
interrupts? Note that the user-land thread that is woken by the
wake-up runs at the same real-time priority as the interrupt
thread (another mistake?).

> Did you ever run with lockdep enabled
> (CONFIG_PROVE_LOCKING=y) ?

No. Sorry, I hadn't noticed that option. I will turn it on before
running the code again.

Thank you for your help,

Martin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Debugging a hard lockup with no symptoms
  2010-04-15 12:59 ` Thomas Gleixner
  2010-04-15 16:08   ` Martin Shepherd
@ 2010-04-15 19:19   ` Martin Shepherd
  2010-04-15 22:47   ` Martin Shepherd
  2 siblings, 0 replies; 5+ messages in thread
From: Martin Shepherd @ 2010-04-15 19:19 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-rt-users

On Thu, 15 Apr 2010, Thomas Gleixner wrote:
> Did you ever run with lockdep enabled
> (CONFIG_PROVE_LOCKING=y) ?

I have now recompiled the kernel with the above option turned
on. However now I get a hard lockup, without any error messages, as
soon as I run my code, instead of after a few hours. Then when I
reboot, most of the object and executable files of my application have
been replaced by empty files. These files had just been compiled
before running the application. So presumably they didn't get flushed
to disk before the crash.

Any ideas? I have tried both with and without nmi_watchdog=1, and the
result was the same; no messages on the serial console or in the
log. Just an immediate hard lockup on running my code.

Martin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Debugging a hard lockup with no symptoms
  2010-04-15 12:59 ` Thomas Gleixner
  2010-04-15 16:08   ` Martin Shepherd
  2010-04-15 19:19   ` Martin Shepherd
@ 2010-04-15 22:47   ` Martin Shepherd
  2 siblings, 0 replies; 5+ messages in thread
From: Martin Shepherd @ 2010-04-15 22:47 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-rt-users

It is now looking as though the lockups may be caused by a problem in
the commercial analog I/O card that I am using, rather than in either
my code or the kernel. The dramatic increase in the frequency of the
lockups today, allowed me to associate the lockups with a couple of
outw() instructions that pass values to two digital to analog
converters on the card. When I comment out those instructions, or put
a few microseconds of delay between them, then the lockups stop. I
have talked to the manufacturer's tech support, and they have a
suspicion of the cause, and are working on trying to reproduce the
problem.

Martin

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-04-15 22:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-14  7:26 Debugging a hard lockup with no symptoms Martin Shepherd
2010-04-15 12:59 ` Thomas Gleixner
2010-04-15 16:08   ` Martin Shepherd
2010-04-15 19:19   ` Martin Shepherd
2010-04-15 22:47   ` Martin Shepherd

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).