2.6.30-rc kills my box hard - and lockdep chains

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.30-rc kills my box hard - and lockdep chains
@ 2009-05-14 15:49 Jonathan Corbet
  2009-05-15  5:32 ` Peter Zijlstra
  2009-05-16 23:14 ` Andrew Morton
  0 siblings, 2 replies; 9+ messages in thread
From: Jonathan Corbet @ 2009-05-14 15:49 UTC (permalink / raw)
  To: LKML

So...every now and then I return to my system (a dual-core 64-bit
x86 box) only to find it totally dead.  Lights are on but there's no
disk activity, no ping responses, no alternative to simply pulling the
plug.  It happens fairly reliably about once a day with the 2.6.30-rc
kernels; it does not happen with 2.6.29.

I'm at a bit of a loss for how to try to track this one down.  "System
disappears without a trace" isn't much to go on.  I can't reproduce it
at will; even the "maintain an unsaved editor buffer with hours' worth
of work" trick doesn't seem to work this time.  

One clue might be found here, perhaps: I didn't have lockdep enabled but I do
now.  First time gave me:

	BUG: MAX_LOCKDEP_ENTRIES too low!

So I doubled it.  Now I get:

May 14 01:06:55 bike kernel: [38730.804833] BUG: MAX_LOCKDEP_CHAINS too low!
May 14 01:06:55 bike kernel: [38730.804838] turning off the locking correctness validator.
May 14 01:06:55 bike kernel: [38730.804843] Pid: 5321, comm: tar Tainted: G        W  2.6.30-rc5 #11
May 14 01:06:55 bike kernel: [38730.804846] Call Trace:
May 14 01:06:55 bike kernel: [38730.804854]  [<ffffffff8025df59>] __lock_acquire+0x57f/0xbc9
May 14 01:06:55 bike kernel: [38730.804860]  [<ffffffff8020f3a9>] ? print_context_stack+0xfa/0x119
May 14 01:06:55 bike kernel: [38730.804866]  [<ffffffff80394da9>] ? get_hash_bucket+0x28/0x34
May 14 01:06:55 bike kernel: [38730.804870]  [<ffffffff8025e660>] lock_acquire+0xbd/0xe1
May 14 01:06:55 bike kernel: [38730.804874]  [<ffffffff80394da9>] ? get_hash_bucket+0x28/0x34
May 14 01:06:55 bike kernel: [38730.804880]  [<ffffffff804fdc24>] _spin_lock_irqsave+0x3c/0x50
May 14 01:06:55 bike kernel: [38730.804883]  [<ffffffff80394da9>] ? get_hash_bucket+0x28/0x34
May 14 01:06:55 bike kernel: [38730.804888]  [<ffffffff8025cfe9>] ? trace_hardirqs_on_caller+0x10f/0x154
May 14 01:06:55 bike kernel: [38730.804892]  [<ffffffff80394da9>] get_hash_bucket+0x28/0x34
May 14 01:06:55 bike kernel: [38730.804896]  [<ffffffff80395509>] add_dma_entry+0x15/0x3b
May 14 01:06:55 bike kernel: [38730.804900]  [<ffffffff80395e26>] ? dma_entry_alloc+0xa3/0xab
May 14 01:06:55 bike kernel: [38730.804904]  [<ffffffff803960d9>] debug_dma_map_page+0xe4/0xf3
May 14 01:06:55 bike kernel: [38730.804909]  [<ffffffff804902d2>] skb_dma_map+0x192/0x230
May 14 01:06:55 bike kernel: [38730.804923]  [<ffffffffa005c5b6>] e1000_xmit_frame+0x618/0x901 [e1000e]
May 14 01:06:55 bike kernel: [38730.804928]  [<ffffffff804a006d>] ? compat_sys_getsockopt+0x10d/0x1ac
May 14 01:06:55 bike kernel: [38730.804933]  [<ffffffff804935ae>] dev_hard_start_xmit+0x228/0x297
May 14 01:06:55 bike kernel: [38730.804937]  [<ffffffff804a1aac>] __qdisc_run+0xf2/0x1fd
May 14 01:06:55 bike kernel: [38730.804941]  [<ffffffff80493977>] dev_queue_xmit+0x241/0x382
May 14 01:06:55 bike kernel: [38730.804945]  [<ffffffff804938a4>] ? dev_queue_xmit+0x16e/0x382
May 14 01:06:55 bike kernel: [38730.804950]  [<ffffffff8049b1bd>] neigh_resolve_output+0x210/0x261
May 14 01:06:55 bike kernel: [38730.804954]  [<ffffffff8025e594>] ? __lock_acquire+0xbba/0xbc9
May 14 01:06:55 bike kernel: [38730.804959]  [<ffffffff804ade6d>] ip_finish_output2+0x1b0/0x1f2
May 14 01:06:55 bike kernel: [38730.804962]  [<ffffffff804adf10>] ip_finish_output+0x61/0x63
May 14 01:06:55 bike kernel: [38730.804966]  [<ffffffff804adf6c>] ip_output+0x5a/0x5c
May 14 01:06:55 bike kernel: [38730.804970]  [<ffffffff804ac504>] dst_output+0x10/0x12
May 14 01:06:55 bike kernel: [38730.804974]  [<ffffffff804acc56>] ip_local_out+0x61/0x63
May 14 01:06:55 bike kernel: [38730.804978]  [<ffffffff804adb1e>] ip_queue_xmit+0x2c2/0x31a
May 14 01:06:55 bike kernel: [38730.804982]  [<ffffffff8025d03b>] ? trace_hardirqs_on+0xd/0xf
May 14 01:06:55 bike kernel: [38730.804986]  [<ffffffff8025cb02>] ? mark_lock+0x2d/0x22f
May 14 01:06:55 bike kernel: [38730.804991]  [<ffffffff804be219>] tcp_transmit_skb+0x5fe/0x63c
May 14 01:06:55 bike kernel: [38730.804996]  [<ffffffff804c0691>] tcp_write_xmit+0x7cd/0x8bd
May 14 01:06:55 bike kernel: [38730.805000]  [<ffffffff804c07e6>] __tcp_push_pending_frames+0x2f/0x5d
May 14 01:06:55 bike kernel: [38730.805005]  [<ffffffff804b3bab>] tcp_push+0x88/0x8a
May 14 01:06:55 bike kernel: [38730.805009]  [<ffffffff804b5cb1>] tcp_sendmsg+0x746/0x841
May 14 01:06:55 bike kernel: [38730.805014]  [<ffffffff8048458a>] __sock_sendmsg+0x61/0x6c
May 14 01:06:55 bike kernel: [38730.805018]  [<ffffffff80484e2c>] sock_sendmsg+0xcc/0xe5
May 14 01:06:55 bike kernel: [38730.805024]  [<ffffffff8024e49d>] ? autoremove_wake_function+0x0/0x39
May 14 01:06:55 bike kernel: [38730.805028]  [<ffffffff804fd9cf>] ? _spin_unlock_irq+0x30/0x36
May 14 01:06:55 bike kernel: [38730.805032]  [<ffffffff8025cb02>] ? mark_lock+0x2d/0x22f
May 14 01:06:55 bike kernel: [38730.805036]  [<ffffffff8025e594>] ? __lock_acquire+0xbba/0xbc9
May 14 01:06:55 bike kernel: [38730.805041]  [<ffffffff802433ae>] ? lock_timer_base+0x2c/0x52
May 14 01:06:55 bike kernel: [38730.805046]  [<ffffffff804851df>] kernel_sendmsg+0x39/0x4e
May 14 01:06:55 bike kernel: [38730.805064]  [<ffffffffa01e04ed>] xs_send_kvec+0x75/0x77 [sunrpc]
May 14 01:06:55 bike kernel: [38730.805068]  [<ffffffff8025d003>] ? trace_hardirqs_on_caller+0x129/0x154
May 14 01:06:55 bike kernel: [38730.805083]  [<ffffffffa01e0564>] xs_sendpages+0x75/0x1a6 [sunrpc]
May 14 01:06:55 bike kernel: [38730.805098]  [<ffffffffa01e079a>] xs_tcp_send_request+0x52/0x149 [sunrpc]
May 14 01:06:55 bike kernel: [38730.805102]  [<ffffffff80243885>] ? del_timer_sync+0x0/0x8b
May 14 01:06:55 bike kernel: [38730.805116]  [<ffffffffa01df4e9>] xprt_transmit+0x169/0x22d [sunrpc]
May 14 01:06:55 bike kernel: [38730.805129]  [<ffffffffa01dcdcd>] call_transmit+0x210/0x251 [sunrpc]
May 14 01:06:55 bike kernel: [38730.805144]  [<ffffffffa01e358d>] __rpc_execute+0x8f/0x250 [sunrpc]
May 14 01:06:55 bike kernel: [38730.805160]  [<ffffffffa01e3774>] rpc_execute+0x26/0x2a [sunrpc]
May 14 01:06:55 bike kernel: [38730.805173]  [<ffffffffa01dd6c4>] rpc_run_task+0x55/0x5d [sunrpc]
May 14 01:06:55 bike kernel: [38730.805185]  [<ffffffffa01dd7b9>] rpc_call_sync+0x44/0x62 [sunrpc]
May 14 01:06:55 bike kernel: [38730.805201]  [<ffffffffa02394fb>] nfs3_rpc_wrapper.clone.0+0x1f/0x58 [nfs]
May 14 01:06:55 bike kernel: [38730.805215]  [<ffffffffa023a0c0>] nfs3_proc_readdir+0xf6/0x137 [nfs]
May 14 01:06:55 bike kernel: [38730.805219]  [<ffffffff8025cb02>] ? mark_lock+0x2d/0x22f
May 14 01:06:55 bike kernel: [38730.805223]  [<ffffffff8025e594>] ? __lock_acquire+0xbba/0xbc9
May 14 01:06:55 bike kernel: [38730.805228]  [<ffffffff8025cb02>] ? mark_lock+0x2d/0x22f
May 14 01:06:55 bike kernel: [38730.805232]  [<ffffffff804fd9cf>] ? _spin_unlock_irq+0x30/0x36
May 14 01:06:55 bike kernel: [38730.805236]  [<ffffffff8025d03b>] ? trace_hardirqs_on+0xd/0xf
May 14 01:06:55 bike kernel: [38730.805247]  [<ffffffffa022ac9a>] nfs_readdir_filler+0xc9/0x172 [nfs]
May 14 01:06:55 bike kernel: [38730.805252]  [<ffffffff8028c989>] ? find_get_page+0x0/0xc5
May 14 01:06:55 bike kernel: [38730.805263]  [<ffffffffa022abd1>] ? nfs_readdir_filler+0x0/0x172 [nfs]
May 14 01:06:55 bike kernel: [38730.805267]  [<ffffffff8028d096>] read_cache_page_async+0x9a/0x129
May 14 01:06:55 bike kernel: [38730.805271]  [<ffffffff8028d137>] read_cache_page+0x12/0x49
May 14 01:06:55 bike kernel: [38730.805282]  [<ffffffffa022bd18>] nfs_readdir+0x259/0x738 [nfs]
May 14 01:06:55 bike kernel: [38730.805294]  [<ffffffffa022f50c>] ? nfs_open+0x15f/0x19b [nfs]
May 14 01:06:55 bike kernel: [38730.805299]  [<ffffffff802c2741>] ? filldir+0x0/0xc4
May 14 01:06:55 bike kernel: [38730.805304]  [<ffffffff8025e594>] ? __lock_acquire+0xbba/0xbc9
May 14 01:06:55 bike kernel: [38730.805308]  [<ffffffff8025cd56>] ? mark_held_locks+0x52/0x70
May 14 01:06:55 bike kernel: [38730.805312]  [<ffffffff804fc430>] ? __mutex_lock_common+0x2e1/0x336
May 14 01:06:55 bike kernel: [38730.805317]  [<ffffffff8025d003>] ? trace_hardirqs_on_caller+0x129/0x154
May 14 01:06:55 bike kernel: [38730.805331]  [<ffffffffa023bb89>] ? nfs3_decode_dirent+0x0/0x137 [nfs]
May 14 01:06:55 bike kernel: [38730.805336]  [<ffffffff802c2741>] ? filldir+0x0/0xc4
May 14 01:06:55 bike kernel: [38730.805340]  [<ffffffff802c2741>] ? filldir+0x0/0xc4
May 14 01:06:55 bike kernel: [38730.805344]  [<ffffffff802c293d>] vfs_readdir+0x79/0xb6
May 14 01:06:55 bike kernel: [38730.805348]  [<ffffffff802c2ac3>] sys_getdents+0x81/0xd1
May 14 01:06:55 bike kernel: [38730.805353]  [<ffffffff8020bcdb>] system_call_fastpath+0x16/0x1b

That's quite the call stack...  and, evidently, a lot of lock chains...  

Suggestions, anybody?

Thanks,

jon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.30-rc kills my box hard - and lockdep chains
  2009-05-14 15:49 2.6.30-rc kills my box hard - and lockdep chains Jonathan Corbet
@ 2009-05-15  5:32 ` Peter Zijlstra
  2009-05-16 23:03   ` Andrew Morton
  2009-05-16 23:14 ` Andrew Morton
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2009-05-15  5:32 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: LKML, Ingo Molnar

On Thu, 2009-05-14 at 09:49 -0600, Jonathan Corbet wrote:
> First time gave me:
> 
>         BUG: MAX_LOCKDEP_ENTRIES too low!
> 
> So I doubled it.  Now I get:
> 
> May 14 01:06:55 bike kernel: [38730.804833] BUG: MAX_LOCKDEP_CHAINS
> too low!

Ingo recently raised these numbers, from the -tip tree:

commit d80c19df5fcceb8c741e96f09f275c2da719efef
Author: Ingo Molnar <mingo@elte.hu>
Date:   Tue May 12 16:29:13 2009 +0200

    lockdep: increase MAX_LOCKDEP_ENTRIES and MAX_LOCKDEP_CHAINS


This will likely make its way upstream soonish.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.30-rc kills my box hard - and lockdep chains
  2009-05-15  5:32 ` Peter Zijlstra
@ 2009-05-16 23:03   ` Andrew Morton
  2009-05-16 23:13     ` Jonathan Corbet
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2009-05-16 23:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jonathan Corbet, LKML, Ingo Molnar

On Fri, 15 May 2009 07:32:35 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, 2009-05-14 at 09:49 -0600, Jonathan Corbet wrote:
> > First time gave me:
> > 
> >         BUG: MAX_LOCKDEP_ENTRIES too low!
> > 
> > So I doubled it.  Now I get:
> > 
> > May 14 01:06:55 bike kernel: [38730.804833] BUG: MAX_LOCKDEP_CHAINS
> > too low!
> 
> Ingo recently raised these numbers, from the -tip tree:
> 
> commit d80c19df5fcceb8c741e96f09f275c2da719efef
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Tue May 12 16:29:13 2009 +0200
> 
>     lockdep: increase MAX_LOCKDEP_ENTRIES and MAX_LOCKDEP_CHAINS
> 
> 
> This will likely make its way upstream soonish.

That patch isn't in linux-next and I can't be bothered going off and
finding it.

Jon said that doubling MAX_LOCKDEP_CHAINS didn't help.  How much did
the above patch increase things by?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.30-rc kills my box hard - and lockdep chains
  2009-05-16 23:03   ` Andrew Morton
@ 2009-05-16 23:13     ` Jonathan Corbet
  0 siblings, 0 replies; 9+ messages in thread
From: Jonathan Corbet @ 2009-05-16 23:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zijlstra, LKML, Ingo Molnar

On Sat, 16 May 2009 16:03:45 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> Jon said that doubling MAX_LOCKDEP_CHAINS didn't help.  How much did
> the above patch increase things by?

I doubled MAX_LOCKDEP_ENTRIES, but hadn't yet gotten around to trying
MAX_LOCKDEP_CHAINS.  It felt like I was going into territory where
nobody else was complaining, so it seemed like maybe trying to paper
over something else.  I've not had a chance to go back and try with
both parameters increased, will do.

jon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.30-rc kills my box hard - and lockdep chains
  2009-05-14 15:49 2.6.30-rc kills my box hard - and lockdep chains Jonathan Corbet
  2009-05-15  5:32 ` Peter Zijlstra
@ 2009-05-16 23:14 ` Andrew Morton
  2009-05-17  0:38   ` Frederic Weisbecker
                     ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Andrew Morton @ 2009-05-16 23:14 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: LKML

On Thu, 14 May 2009 09:49:51 -0600 Jonathan Corbet <corbet@lwn.net> wrote:

> So...every now and then I return to my system (a dual-core 64-bit
> x86 box) only to find it totally dead.  Lights are on but there's no
> disk activity, no ping responses, no alternative to simply pulling the
> plug.  It happens fairly reliably about once a day with the 2.6.30-rc
> kernels; it does not happen with 2.6.29.
> 
> I'm at a bit of a loss for how to try to track this one down.  "System
> disappears without a trace" isn't much to go on.  I can't reproduce it
> at will; even the "maintain an unsaved editor buffer with hours' worth
> of work" trick doesn't seem to work this time.  
> 
> One clue might be found here, perhaps: I didn't have lockdep enabled but I do
> now.

So the lockup isn't due to lockdep.

Did you try all the usual sysrq-P, nmi-watchdog stuff?

Is netconsole enabled, to see if it squawked as it died?

> May 14 01:06:55 bike kernel: [38730.804833] BUG: MAX_LOCKDEP_CHAINS too low!
> May 14 01:06:55 bike kernel: [38730.804838] turning off the locking correctness validator.
> May 14 01:06:55 bike kernel: [38730.804843] Pid: 5321, comm: tar Tainted: G        W  2.6.30-rc5 #11
> May 14 01:06:55 bike kernel: [38730.804846] Call Trace:
> May 14 01:06:55 bike kernel: [38730.804854]  [<ffffffff8025df59>] __lock_acquire+0x57f/0xbc9
> May 14 01:06:55 bike kernel: [38730.804860]  [<ffffffff8020f3a9>] ? print_context_stack+0xfa/0x119
> May 14 01:06:55 bike kernel: [38730.804866]  [<ffffffff80394da9>] ? get_hash_bucket+0x28/0x34
>
> ...
>
> May 14 01:06:55 bike kernel: [38730.805340]  [<ffffffff802c2741>] ? filldir+0x0/0xc4
> May 14 01:06:55 bike kernel: [38730.805344]  [<ffffffff802c293d>] vfs_readdir+0x79/0xb6
> May 14 01:06:55 bike kernel: [38730.805348]  [<ffffffff802c2ac3>] sys_getdents+0x81/0xd1
> May 14 01:06:55 bike kernel: [38730.805353]  [<ffffffff8020bcdb>] system_call_fastpath+0x16/0x1b
> 
> That's quite the call stack...  and, evidently, a lot of lock chains...  

It is a deep stack trace.

And unfortunately

a) that diagnostic didn't print the stack pointer value, from which
   we can often work out if we're looking at a stack overflow.

b) I regularly think it would be useful if that stack backtrace were
   to print out the actual stack address, so we could see how much
   stack each function is using.

   I just went in to hack these things up, but the x86 stacktrace
   code which I used to understand has become stupidly complex so I
   gave up.

What tools do we have to diagnose a possible kernel stack overflow? 
There's CONFIG_DEBUG_STACK_USAGE but that's unlikely to be much use.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.30-rc kills my box hard - and lockdep chains
  2009-05-16 23:14 ` Andrew Morton
@ 2009-05-17  0:38   ` Frederic Weisbecker
  2009-05-17  1:23     ` Frederic Weisbecker
  2009-05-17  1:48   ` Ming Lei
  2009-05-18 19:02   ` Jonathan Corbet
  2 siblings, 1 reply; 9+ messages in thread
From: Frederic Weisbecker @ 2009-05-17  0:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jonathan Corbet, LKML

On Sat, May 16, 2009 at 04:14:19PM -0700, Andrew Morton wrote:
> On Thu, 14 May 2009 09:49:51 -0600 Jonathan Corbet <corbet@lwn.net> wrote:
> 
> > So...every now and then I return to my system (a dual-core 64-bit
> > x86 box) only to find it totally dead.  Lights are on but there's no
> > disk activity, no ping responses, no alternative to simply pulling the
> > plug.  It happens fairly reliably about once a day with the 2.6.30-rc
> > kernels; it does not happen with 2.6.29.
> > 
> > I'm at a bit of a loss for how to try to track this one down.  "System
> > disappears without a trace" isn't much to go on.  I can't reproduce it
> > at will; even the "maintain an unsaved editor buffer with hours' worth
> > of work" trick doesn't seem to work this time.  
> > 
> > One clue might be found here, perhaps: I didn't have lockdep enabled but I do
> > now.
> 
> So the lockup isn't due to lockdep.
> 
> Did you try all the usual sysrq-P, nmi-watchdog stuff?
> 
> Is netconsole enabled, to see if it squawked as it died?
> 
> > May 14 01:06:55 bike kernel: [38730.804833] BUG: MAX_LOCKDEP_CHAINS too low!
> > May 14 01:06:55 bike kernel: [38730.804838] turning off the locking correctness validator.
> > May 14 01:06:55 bike kernel: [38730.804843] Pid: 5321, comm: tar Tainted: G        W  2.6.30-rc5 #11
> > May 14 01:06:55 bike kernel: [38730.804846] Call Trace:
> > May 14 01:06:55 bike kernel: [38730.804854]  [<ffffffff8025df59>] __lock_acquire+0x57f/0xbc9
> > May 14 01:06:55 bike kernel: [38730.804860]  [<ffffffff8020f3a9>] ? print_context_stack+0xfa/0x119
> > May 14 01:06:55 bike kernel: [38730.804866]  [<ffffffff80394da9>] ? get_hash_bucket+0x28/0x34
> >
> > ...
> >
> > May 14 01:06:55 bike kernel: [38730.805340]  [<ffffffff802c2741>] ? filldir+0x0/0xc4
> > May 14 01:06:55 bike kernel: [38730.805344]  [<ffffffff802c293d>] vfs_readdir+0x79/0xb6
> > May 14 01:06:55 bike kernel: [38730.805348]  [<ffffffff802c2ac3>] sys_getdents+0x81/0xd1
> > May 14 01:06:55 bike kernel: [38730.805353]  [<ffffffff8020bcdb>] system_call_fastpath+0x16/0x1b
> > 
> > That's quite the call stack...  and, evidently, a lot of lock chains...  
> 
> It is a deep stack trace.
> 
> And unfortunately
> 
> a) that diagnostic didn't print the stack pointer value, from which
>    we can often work out if we're looking at a stack overflow.
> 
> b) I regularly think it would be useful if that stack backtrace were
>    to print out the actual stack address, so we could see how much
>    stack each function is using.
> 
>    I just went in to hack these things up, but the x86 stacktrace
>    code which I used to understand has become stupidly complex so I
>    gave up.
> 
> What tools do we have to diagnose a possible kernel stack overflow? 
> There's CONFIG_DEBUG_STACK_USAGE but that's unlikely to be much use.


I think about CONFIG_STACK_TRACER. Currently this tracer
dumps the max stack footprint backtrace through a file in debugfs.
Then it's not that much useful to debug a stack overflow.

I'm trying to hack around a printk dump for each max stack footprint
encountered. Hopefully it could help to debug this.

Frederic.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.30-rc kills my box hard - and lockdep chains
  2009-05-17  0:38   ` Frederic Weisbecker
@ 2009-05-17  1:23     ` Frederic Weisbecker
  0 siblings, 0 replies; 9+ messages in thread
From: Frederic Weisbecker @ 2009-05-17  1:23 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: LKML, Andrew Morton, Ingo Molnar, Steven Rostedt

On Sun, May 17, 2009 at 02:38:39AM +0200, Frederic Weisbecker wrote:
> On Sat, May 16, 2009 at 04:14:19PM -0700, Andrew Morton wrote:
> > On Thu, 14 May 2009 09:49:51 -0600 Jonathan Corbet <corbet@lwn.net> wrote:
> > 
> > > So...every now and then I return to my system (a dual-core 64-bit
> > > x86 box) only to find it totally dead.  Lights are on but there's no
> > > disk activity, no ping responses, no alternative to simply pulling the
> > > plug.  It happens fairly reliably about once a day with the 2.6.30-rc
> > > kernels; it does not happen with 2.6.29.
> > > 
> > > I'm at a bit of a loss for how to try to track this one down.  "System
> > > disappears without a trace" isn't much to go on.  I can't reproduce it
> > > at will; even the "maintain an unsaved editor buffer with hours' worth
> > > of work" trick doesn't seem to work this time.  
> > > 
> > > One clue might be found here, perhaps: I didn't have lockdep enabled but I do
> > > now.
> > 
> > So the lockup isn't due to lockdep.
> > 
> > Did you try all the usual sysrq-P, nmi-watchdog stuff?
> > 
> > Is netconsole enabled, to see if it squawked as it died?
> > 
> > > May 14 01:06:55 bike kernel: [38730.804833] BUG: MAX_LOCKDEP_CHAINS too low!
> > > May 14 01:06:55 bike kernel: [38730.804838] turning off the locking correctness validator.
> > > May 14 01:06:55 bike kernel: [38730.804843] Pid: 5321, comm: tar Tainted: G        W  2.6.30-rc5 #11
> > > May 14 01:06:55 bike kernel: [38730.804846] Call Trace:
> > > May 14 01:06:55 bike kernel: [38730.804854]  [<ffffffff8025df59>] __lock_acquire+0x57f/0xbc9
> > > May 14 01:06:55 bike kernel: [38730.804860]  [<ffffffff8020f3a9>] ? print_context_stack+0xfa/0x119
> > > May 14 01:06:55 bike kernel: [38730.804866]  [<ffffffff80394da9>] ? get_hash_bucket+0x28/0x34
> > >
> > > ...
> > >
> > > May 14 01:06:55 bike kernel: [38730.805340]  [<ffffffff802c2741>] ? filldir+0x0/0xc4
> > > May 14 01:06:55 bike kernel: [38730.805344]  [<ffffffff802c293d>] vfs_readdir+0x79/0xb6
> > > May 14 01:06:55 bike kernel: [38730.805348]  [<ffffffff802c2ac3>] sys_getdents+0x81/0xd1
> > > May 14 01:06:55 bike kernel: [38730.805353]  [<ffffffff8020bcdb>] system_call_fastpath+0x16/0x1b
> > > 
> > > That's quite the call stack...  and, evidently, a lot of lock chains...  
> > 
> > It is a deep stack trace.
> > 
> > And unfortunately
> > 
> > a) that diagnostic didn't print the stack pointer value, from which
> >    we can often work out if we're looking at a stack overflow.
> > 
> > b) I regularly think it would be useful if that stack backtrace were
> >    to print out the actual stack address, so we could see how much
> >    stack each function is using.
> > 
> >    I just went in to hack these things up, but the x86 stacktrace
> >    code which I used to understand has become stupidly complex so I
> >    gave up.
> > 
> > What tools do we have to diagnose a possible kernel stack overflow? 
> > There's CONFIG_DEBUG_STACK_USAGE but that's unlikely to be much use.
> 
> 
> I think about CONFIG_STACK_TRACER. Currently this tracer
> dumps the max stack footprint backtrace through a file in debugfs.
> Then it's not that much useful to debug a stack overflow.
> 
> I'm trying to hack around a printk dump for each max stack footprint
> encountered. Hopefully it could help to debug this.
> 
> Frederic.
> 

Jonathan, could you try the following patch?
It will dump a stack trace each time it becomes the new max one.
If it's about a stack overflow it can be helpful to track the last
max stack usage before the crash.

You'll need CONFIG_STACK_TRACER, the "stacktrace" boot parameter
and some luck...

diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index c750f65..fbbe312 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -67,6 +67,9 @@ static inline void check_stack(void)
 
 	save_stack_trace(&max_stack_trace);
 
+	printk("New max stack usage:\n");
+	print_stack_trace(&max_stack_trace, 1);
+
 	/*
 	 * Now find where in the stack these are.
 	 */



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: 2.6.30-rc kills my box hard - and lockdep chains
  2009-05-16 23:14 ` Andrew Morton
  2009-05-17  0:38   ` Frederic Weisbecker
@ 2009-05-17  1:48   ` Ming Lei
  2009-05-18 19:02   ` Jonathan Corbet
  2 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2009-05-17  1:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jonathan Corbet, LKML

2009/5/17 Andrew Morton <akpm@linux-foundation.org>:
> It is a deep stack trace.
>
> And unfortunately
>
> a) that diagnostic didn't print the stack pointer value, from which
>   we can often work out if we're looking at a stack overflow.
>
> b) I regularly think it would be useful if that stack backtrace were
>   to print out the actual stack address, so we could see how much
>   stack each function is using.

I have the same sense, because lockdep uses many recursion function ,which
may cause stack overflow easily.

>
>   I just went in to hack these things up, but the x86 stacktrace
>   code which I used to understand has become stupidly complex so I
>   gave up.
>
> What tools do we have to diagnose a possible kernel stack overflow?
> There's CONFIG_DEBUG_STACK_USAGE but that's unlikely to be much use.

-- 
Lei Ming

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.30-rc kills my box hard - and lockdep chains
  2009-05-16 23:14 ` Andrew Morton
  2009-05-17  0:38   ` Frederic Weisbecker
  2009-05-17  1:48   ` Ming Lei
@ 2009-05-18 19:02   ` Jonathan Corbet
  2 siblings, 0 replies; 9+ messages in thread
From: Jonathan Corbet @ 2009-05-18 19:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML

On Sat, 16 May 2009 16:14:19 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> So the lockup isn't due to lockdep.

It is not; I turned on lockdep in an attempt to track down the problem.

> Did you try all the usual sysrq-P, nmi-watchdog stuff?

Yep.  But I've never been able to get a response out of the system when
this kind of thing happens.

> Is netconsole enabled, to see if it squawked as it died?

That I do not have; worth trying.

One interesting thing of note: I was reliably getting about one lockup
per day.  Until I turned on lockdep.  It's been 4-5 days now with
no repeats of the problem.  I have no idea if it's coincidental or if
there is a real correlation there.

jon

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-05-18 19:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-14 15:49 2.6.30-rc kills my box hard - and lockdep chains Jonathan Corbet
2009-05-15  5:32 ` Peter Zijlstra
2009-05-16 23:03   ` Andrew Morton
2009-05-16 23:13     ` Jonathan Corbet
2009-05-16 23:14 ` Andrew Morton
2009-05-17  0:38   ` Frederic Weisbecker
2009-05-17  1:23     ` Frederic Weisbecker
2009-05-17  1:48   ` Ming Lei
2009-05-18 19:02   ` Jonathan Corbet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox