public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* NFSv4 poops itself
@ 2007-07-27 12:53 Jeff Garzik
  2007-07-27 13:33 ` Trond Myklebust
  2007-07-27 13:36 ` Marc Dietrich
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff Garzik @ 2007-07-27 12:53 UTC (permalink / raw)
  To: Trond Myklebust, Linux Kernel Mailing List
  Cc: Andrew Morton, Michal Piotrowski

Background:

Server: x86-64 dual core Intel, kernel 2.6.23-rc1 (my home fileserver)
	Exporting NFS/NFSv4 mounts.  Client count: 1  Uptime: 4 days

Client: x86-64 dual core Intel, kernel 2.6.23-rc1 (my main workstation)
	NFS mount setup:
	pretzel:/ on /g type nfs4 (rw,noatime,proto=tcp,addr=10.10.10.1)
	Uptime: 4 days

	Home directory mounted via NFSv4.

Problem:

My workstation has been happily talking to my file server for several 
days without incident.  An hour ago, my numeric keypad stopping working 
(unrelated problem... USB or X bug?).  The solution to the keypad 
problem is usually to log out of X and log back in, or worse case, reboot.

So, I log out, and log back in.  At first, a few shell windows open and 
successfully initialize themselves (read bash profile over NFS, etc.) 
Then, as more shell windows open, things start hanging.  I can easily 
switch to console and ssh to the fileserver, so it is clear this is an 
NFS hang.

No adverse messages at all on the client.

On the server, I see NFSv4 spamming dmesg with hundreds of thousands of 
messages:

Jul 27 08:20:53 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
Jul 27 08:21:24 pretzel last message repeated 167966 times
Jul 27 08:21:55 pretzel last message repeated 173628 times
Jul 27 08:21:55 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
Jul 27 08:22:26 pretzel last message repeated 171286 times
Jul 27 08:23:27 pretzel last message repeated 344461 times
Jul 27 08:23:30 pretzel last message repeated 18656 times

I rebooted the client, the problem disappeared, and everything is happy 
again...  but clearly NFSv4 shat itself.  And now I am worried this will 
happen again.

In all my quite-heavy use of NFSv4 I've never seen this behavior before, 
  so I would call this a regression.

I always run vanilla linux-2.6.git self-built kernels on both client and 
server.

	Jeff



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 12:53 NFSv4 poops itself Jeff Garzik
@ 2007-07-27 13:33 ` Trond Myklebust
  2007-07-27 17:09   ` Trond Myklebust
  2007-07-27 13:36 ` Marc Dietrich
  1 sibling, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2007-07-27 13:33 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linux Kernel Mailing List, Andrew Morton, Michal Piotrowski

On Fri, 2007-07-27 at 08:53 -0400, Jeff Garzik wrote:
> Background:
> 
> Server: x86-64 dual core Intel, kernel 2.6.23-rc1 (my home fileserver)
> 	Exporting NFS/NFSv4 mounts.  Client count: 1  Uptime: 4 days
> 
> Client: x86-64 dual core Intel, kernel 2.6.23-rc1 (my main workstation)
> 	NFS mount setup:
> 	pretzel:/ on /g type nfs4 (rw,noatime,proto=tcp,addr=10.10.10.1)
> 	Uptime: 4 days
> 
> 	Home directory mounted via NFSv4.
> 
> Problem:
> 
> My workstation has been happily talking to my file server for several 
> days without incident.  An hour ago, my numeric keypad stopping working 
> (unrelated problem... USB or X bug?).  The solution to the keypad 
> problem is usually to log out of X and log back in, or worse case, reboot.
> 
> So, I log out, and log back in.  At first, a few shell windows open and 
> successfully initialize themselves (read bash profile over NFS, etc.) 
> Then, as more shell windows open, things start hanging.  I can easily 
> switch to console and ssh to the fileserver, so it is clear this is an 
> NFS hang.
> 
> No adverse messages at all on the client.
> 
> On the server, I see NFSv4 spamming dmesg with hundreds of thousands of 
> messages:
> 
> Jul 27 08:20:53 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
> Jul 27 08:21:24 pretzel last message repeated 167966 times
> Jul 27 08:21:55 pretzel last message repeated 173628 times
> Jul 27 08:21:55 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
> Jul 27 08:22:26 pretzel last message repeated 171286 times
> Jul 27 08:23:27 pretzel last message repeated 344461 times
> Jul 27 08:23:30 pretzel last message repeated 18656 times
> 
> I rebooted the client, the problem disappeared, and everything is happy 
> again...  but clearly NFSv4 shat itself.  And now I am worried this will 
> happen again.
> 
> In all my quite-heavy use of NFSv4 I've never seen this behavior before, 
>   so I would call this a regression.

Yup. Bruce has reported the same bug so it is under investigation. I'll
keep you posted when I think we have a fix.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 12:53 NFSv4 poops itself Jeff Garzik
  2007-07-27 13:33 ` Trond Myklebust
@ 2007-07-27 13:36 ` Marc Dietrich
  2007-07-27 13:40   ` Jeff Garzik
  2007-07-27 14:58   ` Satyam Sharma
  1 sibling, 2 replies; 15+ messages in thread
From: Marc Dietrich @ 2007-07-27 13:36 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: kernel list


Hi,

Am Friday 27 July 2007 14:53 schrieben Sie:
> Background:
>
> Server: x86-64 dual core Intel, kernel 2.6.23-rc1 (my home fileserver)
> 	Exporting NFS/NFSv4 mounts.  Client count: 1  Uptime: 4 days
>
> Client: x86-64 dual core Intel, kernel 2.6.23-rc1 (my main workstation)
> 	NFS mount setup:
> 	pretzel:/ on /g type nfs4 (rw,noatime,proto=tcp,addr=10.10.10.1)
> 	Uptime: 4 days
>
> 	Home directory mounted via NFSv4.
>
> Problem:
>
> My workstation has been happily talking to my file server for several
> days without incident.  An hour ago, my numeric keypad stopping working
> (unrelated problem... USB or X bug?).  The solution to the keypad
> problem is usually to log out of X and log back in, or worse case, reboot.
>
> So, I log out, and log back in.  At first, a few shell windows open and
> successfully initialize themselves (read bash profile over NFS, etc.)
> Then, as more shell windows open, things start hanging.  I can easily
> switch to console and ssh to the fileserver, so it is clear this is an
> NFS hang.
>
> No adverse messages at all on the client.
>
> On the server, I see NFSv4 spamming dmesg with hundreds of thousands of
> messages:
>
> Jul 27 08:20:53 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
> Jul 27 08:21:24 pretzel last message repeated 167966 times
> Jul 27 08:21:55 pretzel last message repeated 173628 times
> Jul 27 08:21:55 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
> Jul 27 08:22:26 pretzel last message repeated 171286 times
> Jul 27 08:23:27 pretzel last message repeated 344461 times
> Jul 27 08:23:30 pretzel last message repeated 18656 times
>
> I rebooted the client, the problem disappeared, and everything is happy
> again...  but clearly NFSv4 shat itself.  And now I am worried this will
> happen again.
>
> In all my quite-heavy use of NFSv4 I've never seen this behavior before,
>   so I would call this a regression.
>
> I always run vanilla linux-2.6.git self-built kernels on both client and
> server.

me too, my server has 2.6.18-? (openSUSE 10.2). On the client 
(2.6.23-rc1-mm1), I also see (shortly before the hang)

Jul 26 13:09:19 fb07-iapwap2 kernel: =================================
Jul 26 13:09:19 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
Jul 26 13:09:19 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
Jul 26 13:09:19 fb07-iapwap2 kernel: ---------------------------------
Jul 26 13:09:19 fb07-iapwap2 kernel: inconsistent {softirq-on-W} -> 
{in-softirq-W} usage.
Jul 26 13:09:19 fb07-iapwap2 kernel: hald/3873 [HC0[0]:SC1[1]:HE1:SE0] takes:
Jul 26 13:09:19 fb07-iapwap2 kernel:  (rpc_credcache_lock){-+..}, at: 
[<c01dc166>] _atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013d4f7>] mark_lock+0x77/0x630
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013c094>] add_lock_to_list+0x44/0xc0
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013e8af>] 
__lock_acquire+0x65f/0x1020
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013db0e>] mark_held_locks+0x5e/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c012448d>] local_bh_enable+0x7d/0x130
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013f2cf>] lock_acquire+0x5f/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf770>] put_rpccred+0x60/0x110 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf840>] 
rpcauth_unbindcred+0x20/0x60 [sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcece1f4>] rpc_put_task+0x44/0xb0 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcec8ffd>] rpc_call_sync+0x2d/0x40 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced680d>] rpcb_register+0x10d/0x1c0 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced06ef>] svc_register+0x8f/0x160 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013dca5>] 
trace_hardirqs_on+0xc5/0x170
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced0fc5>] __svc_create+0x1d5/0x200 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced1160>] svc_create+0x10/0x20 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf69602>] nfs_callback_up+0x82/0x120 
[nfs]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf43efd>] nfs_get_client+0x17d/0x390 
[nfs]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0175f59>] kmem_cache_alloc+0x79/0xd0
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013dca5>] 
trace_hardirqs_on+0xc5/0x170
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf44144>] nfs4_set_client+0x34/0x1a0 
[nfs]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf448d1>] 
nfs4_create_server+0x61/0x3c0 [nfs]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c017657f>] __kmalloc+0xdf/0x120
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013dca5>] 
trace_hardirqs_on+0xc5/0x170
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013e987>] 
__lock_acquire+0x737/0x1020
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01e22cd>] copy_from_user+0x3d/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01636e2>] strndup_user+0x62/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf4d647>] nfs4_get_sb+0x3c7/0x610 
[nfs]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013e987>] 
__lock_acquire+0x737/0x1020
Jul 26Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c015d536>] 
get_page_from_freelist+0x316/0x460
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013dca5>] 
trace_hardirqs_on+0xc5/0x170
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c015e0be>] __alloc_pages+0x4e/0x370
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c017b380>] vfs_kern_mount+0xa0/0x120
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0190b1d>] do_mount+0x21d/0x840
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01667c7>] 
handle_mm_fault+0x4a7/0x690
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d1494>] _spin_unlock+0x14/0x20
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0166773>] 
handle_mm_fault+0x453/0x690
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0135c24>] up_read+0x14/0x30
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0116ddd>] do_page_fault+0x1ed/0x6e0
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c018f597>] 
copy_mount_options+0xb7/0x150
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01911b2>] sys_mount+0x72/0xb0
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01040b2>] syscall_call+0x7/0xb
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<ffffffff>] 0xffffffff
Jul 26 13:09:19 fb07-iapwap2 kernel: irq event stamp: 240896
Jul 26 13:09:19 fb07-iapwap2 kernel: hardirqs last  enabled at (240896): 
[<c017592e>] kmem_cache_free+0x7e/0xb0
Jul 26 13:09:19 fb07-iapwap2 kernel: hardirqs last disabled at (240895): 
[<c01758fc>] kmem_cache_free+0x4c/0xb0
Jul 26 13:09:19 fb07-iapwap2 kernel: softirqs last  enabled at (240446): 
[<c0124255>] do_softirq+0x45/0x50
Jul 26 13:09:19 fb07-iapwap2 kernel: softirqs last disabled at (240887): 
[<c0124255>] do_softirq+0x45/0x50
Jul 26 13:09:19 fb07-iapwap2 kernel:
Jul 26 13:09:19 fb07-iapwap2 kernel: other info that might help us debug this:
Jul 26 13:09:19 fb07-iapwap2 kernel: no locks held by hald/3873.
Jul 26 13:09:19 fb07-iapwap2 kernel:
Jul 26 13:09:19 fb07-iapwap2 kernel: stack backtrace:
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013ccbc>] print_usage_bug+0x15c/0x170
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013d8da>] mark_lock+0x45a/0x630
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013e857>] __lock_acquire+0x607/0x1020
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013e987>] __lock_acquire+0x737/0x1020
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013f2cf>] lock_acquire+0x5f/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01dc166>] 
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02d156a>] _spin_lock+0x2a/0x40
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01dc166>] 
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01dc166>] 
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<dcecf770>] put_rpccred+0x60/0x110 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<dcf67e10>] 
nfs_free_delegation_callback+0x10/0x20 [nfs]
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c012fa47>] 
__rcu_process_callbacks+0x67/0x1c0
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c012fbaf>] 
rcu_process_callbacks+0xf/0x20
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0124293>] tasklet_action+0x33/0x70
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01241b5>] __do_softirq+0x55/0xb0
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0124255>] do_softirq+0x45/0x50
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0112490>] 
smp_apic_timer_interrupt+0x40/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0104ad9>] 
apic_timer_interrupt+0x29/0x38
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0104ae3>] 
apic_timer_interrupt+0x33/0x38
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02d18a2>] _spin_unlock_irq+0x22/0x30
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02cecb4>] 
__sched_text_start+0x1c4/0x390
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02cec29>] 
__sched_text_start+0x139/0x390
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0127feb>] __mod_timer+0x8b/0xa0
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02cf63a>] schedule_timeout+0x4a/0xc0
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0127b40>] process_timeout+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0186197>] do_sys_poll+0x237/0x320
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0186d70>] __pollwait+0x0/0xf0
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>] 
default_wake_function+0x0/0x10
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02d1444>] _read_unlock+0x14/0x20
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02cbae5>] 
unix_stream_sendmsg+0x1a5/0x350
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0263727>] sock_aio_write+0x117/0x140
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013e987>] __lock_acquire+0x737/0x1020
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0179874>] do_readv_writev+0x144/0x1c0
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0263610>] sock_aio_write+0x0/0x140
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0136ac1>] do_gettimeofday+0x31/0xf0
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0104108>] restore_nocheck+0x12/0x15
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013dca5>] 
trace_hardirqs_on+0xc5/0x170
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01e25af>] copy_to_user+0x3f/0x70
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01862a8>] sys_poll+0x28/0x40
Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01040b2>] syscall_call+0x7/0xb
Jul 26 13:09:19 fb07-iapwap2 kernel:  =======================

marc

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 13:36 ` Marc Dietrich
@ 2007-07-27 13:40   ` Jeff Garzik
  2007-07-27 14:45     ` Trond Myklebust
  2007-07-27 14:58   ` Satyam Sharma
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff Garzik @ 2007-07-27 13:40 UTC (permalink / raw)
  To: Marc Dietrich; +Cc: kernel list, Trond.Myklebust, Andrew Morton


(please don't drop CC's when you reply to email; you are cutting 
relevant people out of the loop)


Marc Dietrich wrote:
> me too, my server has 2.6.18-? (openSUSE 10.2). On the client 
> (2.6.23-rc1-mm1), I also see (shortly before the hang)
> 
> Jul 26 13:09:19 fb07-iapwap2 kernel: =================================
> Jul 26 13:09:19 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
> Jul 26 13:09:19 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
> Jul 26 13:09:19 fb07-iapwap2 kernel: ---------------------------------
> Jul 26 13:09:19 fb07-iapwap2 kernel: inconsistent {softirq-on-W} -> 
> {in-softirq-W} usage.
> Jul 26 13:09:19 fb07-iapwap2 kernel: hald/3873 [HC0[0]:SC1[1]:HE1:SE0] takes:
> Jul 26 13:09:19 fb07-iapwap2 kernel:  (rpc_credcache_lock){-+..}, at: 
> [<c01dc166>] _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013d4f7>] mark_lock+0x77/0x630
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013c094>] add_lock_to_list+0x44/0xc0
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013e8af>] 
> __lock_acquire+0x65f/0x1020
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013db0e>] mark_held_locks+0x5e/0x80
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c012448d>] local_bh_enable+0x7d/0x130
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013f2cf>] lock_acquire+0x5f/0x80
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
> _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
> _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
> _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf770>] put_rpccred+0x60/0x110 
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf840>] 
> rpcauth_unbindcred+0x20/0x60 [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcece1f4>] rpc_put_task+0x44/0xb0 
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcec8ffd>] rpc_call_sync+0x2d/0x40 
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced680d>] rpcb_register+0x10d/0x1c0 
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced06ef>] svc_register+0x8f/0x160 
> [sunrpc]
[continues]

That's quite interesting.  I wonder if I enable lock debugging and 
tracing, if new stuff will appear.

	Jeff



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 13:40   ` Jeff Garzik
@ 2007-07-27 14:45     ` Trond Myklebust
  0 siblings, 0 replies; 15+ messages in thread
From: Trond Myklebust @ 2007-07-27 14:45 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Marc Dietrich, kernel list, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2616 bytes --]

On Fri, 2007-07-27 at 09:40 -0400, Jeff Garzik wrote:
> (please don't drop CC's when you reply to email; you are cutting 
> relevant people out of the loop)
> 
> 
> Marc Dietrich wrote:
> > me too, my server has 2.6.18-? (openSUSE 10.2). On the client 
> > (2.6.23-rc1-mm1), I also see (shortly before the hang)
> > 
> > Jul 26 13:09:19 fb07-iapwap2 kernel: =================================
> > Jul 26 13:09:19 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
> > Jul 26 13:09:19 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
> > Jul 26 13:09:19 fb07-iapwap2 kernel: ---------------------------------
> > Jul 26 13:09:19 fb07-iapwap2 kernel: inconsistent {softirq-on-W} -> 
> > {in-softirq-W} usage.
> > Jul 26 13:09:19 fb07-iapwap2 kernel: hald/3873 [HC0[0]:SC1[1]:HE1:SE0] takes:
> > Jul 26 13:09:19 fb07-iapwap2 kernel:  (rpc_credcache_lock){-+..}, at: 
> > [<c01dc166>] _atomic_dec_and_lock+0x16/0x60
> > Jul 26 13:09:19 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013d4f7>] mark_lock+0x77/0x630
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013c094>] add_lock_to_list+0x44/0xc0
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013e8af>] 
> > __lock_acquire+0x65f/0x1020
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013db0e>] mark_held_locks+0x5e/0x80
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c012448d>] local_bh_enable+0x7d/0x130
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013f2cf>] lock_acquire+0x5f/0x80
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
> > _atomic_dec_and_lock+0x16/0x60
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
> > _atomic_dec_and_lock+0x16/0x60
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
> > _atomic_dec_and_lock+0x16/0x60
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf770>] put_rpccred+0x60/0x110 
> > [sunrpc]
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf840>] 
> > rpcauth_unbindcred+0x20/0x60 [sunrpc]
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcece1f4>] rpc_put_task+0x44/0xb0 
> > [sunrpc]
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcec8ffd>] rpc_call_sync+0x2d/0x40 
> > [sunrpc]
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced680d>] rpcb_register+0x10d/0x1c0 
> > [sunrpc]
> > Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced06ef>] svc_register+0x8f/0x160 
> > [sunrpc]
> [continues]

That particular hang in rpciod_down we do have a fix for, but it is not
related to the issue you were seeing Jeff.

Trond


[-- Attachment #2: linux-2.6.23-001-fix_rpciod_down_race.dif --]
[-- Type: message/rfc822, Size: 3070 bytes --]

From: Trond Myklebust <Trond.Myklebust@netapp.com>
Subject: No Subject
Date: Thu, 19 Jul 2007 16:32:20 -0400
Message-ID: <1185547550.6586.25.camel@localhost>

The commit 4ada539ed77c7a2bbcb75cafbbd7bd8d2b9bef7b lead to the unpleasant
possibility of an asynchronous rpc_task being required to call
rpciod_down() when it is complete. This again means that the rpciod
workqueue may get to call destroy_workqueue on itself -> hang...

Change rpciod_up/rpciod_down to just get/put the module, and then
create/destroy the workqueues on module load/unload.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---

 net/sunrpc/sched.c |   57 +++++++++++++++++++++-------------------------------
 1 files changed, 23 insertions(+), 34 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index b5723c2..954d7ec 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -50,8 +50,6 @@ static RPC_WAITQ(delay_queue, "delayq");
 /*
  * rpciod-related stuff
  */
-static DEFINE_MUTEX(rpciod_mutex);
-static atomic_t rpciod_users = ATOMIC_INIT(0);
 struct workqueue_struct *rpciod_workqueue;
 
 /*
@@ -961,60 +959,49 @@ void rpc_killall_tasks(struct rpc_clnt *clnt)
 	spin_unlock(&clnt->cl_lock);
 }
 
+int rpciod_up(void)
+{
+	return try_module_get(THIS_MODULE) ? 0 : -EINVAL;
+}
+
+void rpciod_down(void)
+{
+	module_put(THIS_MODULE);
+}
+
 /*
- * Start up the rpciod process if it's not already running.
+ * Start up the rpciod workqueue.
  */
-int
-rpciod_up(void)
+static int rpciod_start(void)
 {
 	struct workqueue_struct *wq;
-	int error = 0;
-
-	if (atomic_inc_not_zero(&rpciod_users))
-		return 0;
-
-	mutex_lock(&rpciod_mutex);
 
-	/* Guard against races with rpciod_down() */
-	if (rpciod_workqueue != NULL)
-		goto out_ok;
 	/*
 	 * Create the rpciod thread and wait for it to start.
 	 */
 	dprintk("RPC:       creating workqueue rpciod\n");
-	error = -ENOMEM;
 	wq = create_workqueue("rpciod");
-	if (wq == NULL)
-		goto out;
-
 	rpciod_workqueue = wq;
-	error = 0;
-out_ok:
-	atomic_inc(&rpciod_users);
-out:
-	mutex_unlock(&rpciod_mutex);
-	return error;
+	return rpciod_workqueue != NULL;
 }
 
-void
-rpciod_down(void)
+static void rpciod_stop(void)
 {
-	if (!atomic_dec_and_test(&rpciod_users))
-		return;
+	struct workqueue_struct *wq = NULL;
 
-	mutex_lock(&rpciod_mutex);
+	if (rpciod_workqueue == NULL)
+		return;
 	dprintk("RPC:       destroying workqueue rpciod\n");
 
-	if (atomic_read(&rpciod_users) == 0 && rpciod_workqueue != NULL) {
-		destroy_workqueue(rpciod_workqueue);
-		rpciod_workqueue = NULL;
-	}
-	mutex_unlock(&rpciod_mutex);
+	wq = rpciod_workqueue;
+	rpciod_workqueue = NULL;
+	destroy_workqueue(wq);
 }
 
 void
 rpc_destroy_mempool(void)
 {
+	rpciod_stop();
 	if (rpc_buffer_mempool)
 		mempool_destroy(rpc_buffer_mempool);
 	if (rpc_task_mempool)
@@ -1048,6 +1035,8 @@ rpc_init_mempool(void)
 						      rpc_buffer_slabp);
 	if (!rpc_buffer_mempool)
 		goto err_nomem;
+	if (!rpciod_start())
+		goto err_nomem;
 	return 0;
 err_nomem:
 	rpc_destroy_mempool();

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 13:36 ` Marc Dietrich
  2007-07-27 13:40   ` Jeff Garzik
@ 2007-07-27 14:58   ` Satyam Sharma
  2007-07-27 17:58     ` Marc Dietrich
  1 sibling, 1 reply; 15+ messages in thread
From: Satyam Sharma @ 2007-07-27 14:58 UTC (permalink / raw)
  To: Marc Dietrich
  Cc: Jeff Garzik, kernel list, Trond Myklebust, Andrew Morton,
	Michal Piotrowski

On 7/27/07, Marc Dietrich <Marc.Dietrich@ap.physik.uni-giessen.de> wrote:
>
> Hi,
>
> Am Friday 27 July 2007 14:53 schrieben Sie:
> > Background:
> >
> > Server: x86-64 dual core Intel, kernel 2.6.23-rc1 (my home fileserver)
> >       Exporting NFS/NFSv4 mounts.  Client count: 1  Uptime: 4 days
> >
> > Client: x86-64 dual core Intel, kernel 2.6.23-rc1 (my main workstation)
> >       NFS mount setup:
> >       pretzel:/ on /g type nfs4 (rw,noatime,proto=tcp,addr=10.10.10.1)
> >       Uptime: 4 days
> >
> >       Home directory mounted via NFSv4.
> >
> > Problem:
> >
> > My workstation has been happily talking to my file server for several
> > days without incident.  An hour ago, my numeric keypad stopping working
> > (unrelated problem... USB or X bug?).  The solution to the keypad
> > problem is usually to log out of X and log back in, or worse case, reboot.
> >
> > So, I log out, and log back in.  At first, a few shell windows open and
> > successfully initialize themselves (read bash profile over NFS, etc.)
> > Then, as more shell windows open, things start hanging.  I can easily
> > switch to console and ssh to the fileserver, so it is clear this is an
> > NFS hang.
> >
> > No adverse messages at all on the client.
> >
> > On the server, I see NFSv4 spamming dmesg with hundreds of thousands of
> > messages:
> >
> > Jul 27 08:20:53 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
> > Jul 27 08:21:24 pretzel last message repeated 167966 times
> > Jul 27 08:21:55 pretzel last message repeated 173628 times
> > Jul 27 08:21:55 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
> > Jul 27 08:22:26 pretzel last message repeated 171286 times
> > Jul 27 08:23:27 pretzel last message repeated 344461 times
> > Jul 27 08:23:30 pretzel last message repeated 18656 times
> >
> > I rebooted the client, the problem disappeared, and everything is happy
> > again...  but clearly NFSv4 shat itself.  And now I am worried this will
> > happen again.
> >
> > In all my quite-heavy use of NFSv4 I've never seen this behavior before,
> >   so I would call this a regression.
> >
> > I always run vanilla linux-2.6.git self-built kernels on both client and
> > server.
>
> me too, my server has 2.6.18-? (openSUSE 10.2). On the client
> (2.6.23-rc1-mm1), I also see (shortly before the hang)
>
> Jul 26 13:09:19 fb07-iapwap2 kernel: =================================
> Jul 26 13:09:19 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
> Jul 26 13:09:19 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
> Jul 26 13:09:19 fb07-iapwap2 kernel: ---------------------------------
> Jul 26 13:09:19 fb07-iapwap2 kernel: inconsistent {softirq-on-W} ->
> {in-softirq-W} usage.
> Jul 26 13:09:19 fb07-iapwap2 kernel: hald/3873 [HC0[0]:SC1[1]:HE1:SE0] takes:
> Jul 26 13:09:19 fb07-iapwap2 kernel:  (rpc_credcache_lock){-+..}, at:
> [<c01dc166>] _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013d4f7>] mark_lock+0x77/0x630
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013c094>] add_lock_to_list+0x44/0xc0
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013e8af>]
> __lock_acquire+0x65f/0x1020
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013db0e>] mark_held_locks+0x5e/0x80
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c012448d>] local_bh_enable+0x7d/0x130
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013f2cf>] lock_acquire+0x5f/0x80
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>]
> _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>]
> _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>]
> _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf770>] put_rpccred+0x60/0x110
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf840>]
> rpcauth_unbindcred+0x20/0x60 [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcece1f4>] rpc_put_task+0x44/0xb0
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcec8ffd>] rpc_call_sync+0x2d/0x40
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced680d>] rpcb_register+0x10d/0x1c0
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced06ef>] svc_register+0x8f/0x160
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013dca5>]
> trace_hardirqs_on+0xc5/0x170
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced0fc5>] __svc_create+0x1d5/0x200
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dced1160>] svc_create+0x10/0x20
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf69602>] nfs_callback_up+0x82/0x120
> [nfs]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf43efd>] nfs_get_client+0x17d/0x390
> [nfs]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0175f59>] kmem_cache_alloc+0x79/0xd0
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013dca5>]
> trace_hardirqs_on+0xc5/0x170
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf44144>] nfs4_set_client+0x34/0x1a0
> [nfs]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf448d1>]
> nfs4_create_server+0x61/0x3c0 [nfs]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c017657f>] __kmalloc+0xdf/0x120
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013dca5>]
> trace_hardirqs_on+0xc5/0x170
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013e987>]
> __lock_acquire+0x737/0x1020
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01e22cd>] copy_from_user+0x3d/0x80
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01636e2>] strndup_user+0x62/0x80
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcf4d647>] nfs4_get_sb+0x3c7/0x610
> [nfs]
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013e987>]
> __lock_acquire+0x737/0x1020
> Jul 26Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c015d536>]
> get_page_from_freelist+0x316/0x460
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013dca5>]
> trace_hardirqs_on+0xc5/0x170
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c015e0be>] __alloc_pages+0x4e/0x370
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c017b380>] vfs_kern_mount+0xa0/0x120
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0190b1d>] do_mount+0x21d/0x840
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01667c7>]
> handle_mm_fault+0x4a7/0x690
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d1494>] _spin_unlock+0x14/0x20
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0166773>]
> handle_mm_fault+0x453/0x690
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0135c24>] up_read+0x14/0x30
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c0116ddd>] do_page_fault+0x1ed/0x6e0
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c018f597>]
> copy_mount_options+0xb7/0x150
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01911b2>] sys_mount+0x72/0xb0
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01040b2>] syscall_call+0x7/0xb
> Jul 26 13:09:19 fb07-iapwap2 kernel:   [<ffffffff>] 0xffffffff
> Jul 26 13:09:19 fb07-iapwap2 kernel: irq event stamp: 240896
> Jul 26 13:09:19 fb07-iapwap2 kernel: hardirqs last  enabled at (240896):
> [<c017592e>] kmem_cache_free+0x7e/0xb0
> Jul 26 13:09:19 fb07-iapwap2 kernel: hardirqs last disabled at (240895):
> [<c01758fc>] kmem_cache_free+0x4c/0xb0
> Jul 26 13:09:19 fb07-iapwap2 kernel: softirqs last  enabled at (240446):
> [<c0124255>] do_softirq+0x45/0x50
> Jul 26 13:09:19 fb07-iapwap2 kernel: softirqs last disabled at (240887):
> [<c0124255>] do_softirq+0x45/0x50
> Jul 26 13:09:19 fb07-iapwap2 kernel:
> Jul 26 13:09:19 fb07-iapwap2 kernel: other info that might help us debug this:
> Jul 26 13:09:19 fb07-iapwap2 kernel: no locks held by hald/3873.
> Jul 26 13:09:19 fb07-iapwap2 kernel:
> Jul 26 13:09:19 fb07-iapwap2 kernel: stack backtrace:
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013ccbc>] print_usage_bug+0x15c/0x170
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013d8da>] mark_lock+0x45a/0x630
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013e857>] __lock_acquire+0x607/0x1020
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013e987>] __lock_acquire+0x737/0x1020
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013f2cf>] lock_acquire+0x5f/0x80
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01dc166>]
> _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02d156a>] _spin_lock+0x2a/0x40
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01dc166>]
> _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01dc166>]
> _atomic_dec_and_lock+0x16/0x60
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<dcecf770>] put_rpccred+0x60/0x110
> [sunrpc]
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<dcf67e10>]
> nfs_free_delegation_callback+0x10/0x20 [nfs]
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c012fa47>]
> __rcu_process_callbacks+0x67/0x1c0
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c012fbaf>]
> rcu_process_callbacks+0xf/0x20
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0124293>] tasklet_action+0x33/0x70
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01241b5>] __do_softirq+0x55/0xb0
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0124255>] do_softirq+0x45/0x50
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0112490>]
> smp_apic_timer_interrupt+0x40/0x80
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0104ad9>]
> apic_timer_interrupt+0x29/0x38
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0104ae3>]
> apic_timer_interrupt+0x33/0x38
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02d18a2>] _spin_unlock_irq+0x22/0x30
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02cecb4>]
> __sched_text_start+0x1c4/0x390
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02cec29>]
> __sched_text_start+0x139/0x390
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0127feb>] __mod_timer+0x8b/0xa0
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02cf63a>] schedule_timeout+0x4a/0xc0
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0127b40>] process_timeout+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0186197>] do_sys_poll+0x237/0x320
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0186d70>] __pollwait+0x0/0xf0
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c011aa50>]
> default_wake_function+0x0/0x10
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02d1444>] _read_unlock+0x14/0x20
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c02cbae5>]
> unix_stream_sendmsg+0x1a5/0x350
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0263727>] sock_aio_write+0x117/0x140
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013e987>] __lock_acquire+0x737/0x1020
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0179874>] do_readv_writev+0x144/0x1c0
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0263610>] sock_aio_write+0x0/0x140
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0136ac1>] do_gettimeofday+0x31/0xf0
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c0104108>] restore_nocheck+0x12/0x15
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c013dca5>]
> trace_hardirqs_on+0xc5/0x170
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01e25af>] copy_to_user+0x3f/0x70
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01862a8>] sys_poll+0x28/0x40
> Jul 26 13:09:19 fb07-iapwap2 kernel:  [<c01040b2>] syscall_call+0x7/0xb
> Jul 26 13:09:19 fb07-iapwap2 kernel:  =======================

Doesn't appear related ... could you rebuild with CONFIG_FRAME_POINTER
and test + resend the above log? The trace up there looks quite ugly.

[ Unrelated, but I just realized from lib/Kconfig.debug that stuff in there
"select"s LOCKDEP, which itself "selects" frame pointers (on some archs).
But I thought "select" didn't chain properly like that? There's whole lots
of depending and selecting and non-user-visible options out there in
that file -- and I'm not sure all those selects and depends are really
even required to successfully build (?) ]


Satyam

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 13:33 ` Trond Myklebust
@ 2007-07-27 17:09   ` Trond Myklebust
  2007-07-27 17:16     ` Jeff Garzik
  2007-07-27 18:29     ` Marc Dietrich
  0 siblings, 2 replies; 15+ messages in thread
From: Trond Myklebust @ 2007-07-27 17:09 UTC (permalink / raw)
  To: Jeff Garzik, Dr. J. Bruce Fields
  Cc: Linux Kernel Mailing List, Andrew Morton, Michal Piotrowski

On Fri, 2007-07-27 at 09:33 -0400, Trond Myklebust wrote:
> On Fri, 2007-07-27 at 08:53 -0400, Jeff Garzik wrote:
> > Background:
> > 
> > Server: x86-64 dual core Intel, kernel 2.6.23-rc1 (my home fileserver)
> > 	Exporting NFS/NFSv4 mounts.  Client count: 1  Uptime: 4 days
> > 
> > Client: x86-64 dual core Intel, kernel 2.6.23-rc1 (my main workstation)
> > 	NFS mount setup:
> > 	pretzel:/ on /g type nfs4 (rw,noatime,proto=tcp,addr=10.10.10.1)
> > 	Uptime: 4 days
> > 
> > 	Home directory mounted via NFSv4.
> > 
> > Problem:
> > 
> > My workstation has been happily talking to my file server for several 
> > days without incident.  An hour ago, my numeric keypad stopping working 
> > (unrelated problem... USB or X bug?).  The solution to the keypad 
> > problem is usually to log out of X and log back in, or worse case, reboot.
> > 
> > So, I log out, and log back in.  At first, a few shell windows open and 
> > successfully initialize themselves (read bash profile over NFS, etc.) 
> > Then, as more shell windows open, things start hanging.  I can easily 
> > switch to console and ssh to the fileserver, so it is clear this is an 
> > NFS hang.
> > 
> > No adverse messages at all on the client.
> > 
> > On the server, I see NFSv4 spamming dmesg with hundreds of thousands of 
> > messages:
> > 
> > Jul 27 08:20:53 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
> > Jul 27 08:21:24 pretzel last message repeated 167966 times
> > Jul 27 08:21:55 pretzel last message repeated 173628 times
> > Jul 27 08:21:55 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
> > Jul 27 08:22:26 pretzel last message repeated 171286 times
> > Jul 27 08:23:27 pretzel last message repeated 344461 times
> > Jul 27 08:23:30 pretzel last message repeated 18656 times

Jeff and Bruce, could you please try to reproduce the problem after
either applying patches 001 to 004 or just the single NFS_ALL patch
from 

   http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc1/

Cheers
  Trond

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 17:09   ` Trond Myklebust
@ 2007-07-27 17:16     ` Jeff Garzik
  2007-07-27 17:29       ` Trond Myklebust
  2007-07-27 18:29     ` Marc Dietrich
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff Garzik @ 2007-07-27 17:16 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Dr. J. Bruce Fields, Linux Kernel Mailing List, Andrew Morton,
	Michal Piotrowski

Trond Myklebust wrote:
> Jeff and Bruce, could you please try to reproduce the problem after
> either applying patches 001 to 004 or just the single NFS_ALL patch
> from 
> 
>    http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc1/


It's not easily reproducible here.  Any hints on triggering the behavior?

	Jeff



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 17:16     ` Jeff Garzik
@ 2007-07-27 17:29       ` Trond Myklebust
  2007-08-03 19:13         ` J. Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2007-07-27 17:29 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Dr. J. Bruce Fields, Linux Kernel Mailing List, Andrew Morton,
	Michal Piotrowski

On Fri, 2007-07-27 at 13:16 -0400, Jeff Garzik wrote:
> Trond Myklebust wrote:
> > Jeff and Bruce, could you please try to reproduce the problem after
> > either applying patches 001 to 004 or just the single NFS_ALL patch
> > from 
> > 
> >    http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc1/
> 
> 
> It's not easily reproducible here.  Any hints on triggering the behavior?

I suspect that the problem may involve mixing O_RDWR with O_RDONLY
and/or O_WRONLY file access. Bruce might have a better suggestion,
though.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 14:58   ` Satyam Sharma
@ 2007-07-27 17:58     ` Marc Dietrich
  2007-07-27 18:39       ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: Marc Dietrich @ 2007-07-27 17:58 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Jeff Garzik, kernel list, Trond Myklebust, Andrew Morton,
	Michal Piotrowski


Hi again,

Am Freitag, 27. Juli 2007 16:58 schrieb Satyam Sharma:
> On 7/27/07, Marc Dietrich <Marc.Dietrich@ap.physik.uni-giessen.de> wrote:
> > Hi,
> >
> > (...)
> >
> > me too, my server has 2.6.18-? (openSUSE 10.2). On the client
> > (2.6.23-rc1-mm1), I also see (shortly before the hang)
> >
> > Jul 26 13:09:19 fb07-iapwap2 kernel: =================================
> > Jul 26 13:09:19 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
> > Jul 26 13:09:19 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
> > Jul 26 13:09:19 fb07-iapwap2 kernel: ---------------------------------
> > Jul 26 13:09:19 fb07-iapwap2 kernel: inconsistent {softirq-on-W} ->
> > (cut away ugly dump)
>
> Doesn't appear related ... could you rebuild with CONFIG_FRAME_POINTER
> and test + resend the above log? The trace up there looks quite ugly.

this seems to be better ...

Jul 27 19:43:33 fb07-iapwap2 kernel: =================================
Jul 27 19:43:33 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
Jul 27 19:43:33 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
Jul 27 19:43:33 fb07-iapwap2 kernel: ---------------------------------
Jul 27 19:43:33 fb07-iapwap2 kernel: inconsistent {softirq-on-W} -> 
{in-softirq-W} usage.
Jul 27 19:43:33 fb07-iapwap2 kernel: kdm/5909 [HC0[0]:SC1[1]:HE1:SE0] takes:
Jul 27 19:43:33 fb07-iapwap2 kernel:  (rpc_credcache_lock){-+..}, at: 
[<c01dba67>] _atomic_dec_and_lock+0x17/0x60
Jul 27 19:43:33 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c013e800>] 
__lock_acquire+0x650/0x1030
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c013f241>] lock_acquire+0x61/0x80
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c02db8dc>] _spin_lock+0x2c/0x40
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c01dba67>] 
_atomic_dec_and_lock+0x17/0x60
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dceca5fd>] put_rpccred+0x5d/0x100 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dceca6c1>] 
rpcauth_unbindcred+0x21/0x60 [sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcec8fd4>] rpc_put_task+0x44/0xa0 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcec3fe0>] rpc_call_sync+0x30/0x40 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dced172b>] rpcb_register+0xdb/0x180 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcecb5b3>] svc_register+0x93/0x160 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcecbebe>] __svc_create+0x1ee/0x220 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcecc053>] svc_create+0x13/0x20 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf62752>] nfs_callback_up+0x82/0x120 
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf3df36>] nfs_get_client+0x176/0x390 
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf3e181>] nfs4_set_client+0x31/0x190 
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf3e983>] 
nfs4_create_server+0x63/0x3b0 [nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf47426>] nfs4_get_sb+0x346/0x5b0 
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c017aa74>] vfs_kern_mount+0x94/0x110
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c01900c2>] do_mount+0x1f2/0x7d0
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c0190706>] sys_mount+0x66/0xa0
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c01040de>] syscall_call+0x7/0xb
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<ffffffff>] 0xffffffff
Jul 27 19:43:33 fb07-iapwap2 kernel: irq event stamp: 4916
Jul 27 19:43:33 fb07-iapwap2 kernel: hardirqs last  enabled at (4916): 
[<c017518a>] kmem_cache_free+0x8a/0xc0
Jul 27 19:43:33 fb07-iapwap2 kernel: hardirqs last disabled at (4915): 
[<c0175152>] kmem_cache_free+0x52/0xc0
Jul 27 19:43:33 fb07-iapwap2 kernel: softirqs last  enabled at (4888): 
[<dcec9ff1>] rpc_wake_up_next+0x81/0x180 [sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel: softirqs last disabled at (4905): 
[<c01241e7>] do_softirq+0x47/0x50
Jul 27 19:43:33 fb07-iapwap2 kernel:
Jul 27 19:43:33 fb07-iapwap2 kernel: other info that might help us debug this:
Jul 27 19:43:33 fb07-iapwap2 kernel: 1 lock held by kdm/5909:
Jul 27 19:43:33 fb07-iapwap2 kernel:  #0:  (&inode->i_mutex/1){--..}, at: 
[<c0182efd>] do_unlinkat+0x7d/0x140
Jul 27 19:43:33 fb07-iapwap2 kernel:
Jul 27 19:43:33 fb07-iapwap2 kernel: stack backtrace:
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c010506a>] 
show_trace_log_lvl+0x1a/0x30
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0105c92>] show_trace+0x12/0x20
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0105da5>] dump_stack+0x15/0x20
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c013cc53>] print_usage_bug+0x153/0x160
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c013d849>] mark_lock+0x449/0x620
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c013e7b4>] __lock_acquire+0x604/0x1030
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c013f241>] lock_acquire+0x61/0x80
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c02db8dc>] _spin_lock+0x2c/0x40
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c01dba67>] 
_atomic_dec_and_lock+0x17/0x60
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dceca5fd>] put_rpccred+0x5d/0x100 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf60fb3>] 
nfs_free_delegation_callback+0x13/0x20 [nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c012fa3a>] 
__rcu_process_callbacks+0x6a/0x1c0
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c012fba2>] 
rcu_process_callbacks+0x12/0x30
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0124228>] tasklet_action+0x38/0x80
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0124135>] __do_softirq+0x55/0xc0
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c01241e7>] do_softirq+0x47/0x50
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0124615>] irq_exit+0x35/0x40
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0112463>] 
smp_apic_timer_interrupt+0x43/0x80
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0104b0f>] 
apic_timer_interrupt+0x33/0x38
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcecab53>] 
rpcauth_lookup_credcache+0x83/0x1c0 [sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcecb1a8>] unx_lookup_cred+0x8/0x10 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dceca8fe>] rpcauth_bindcred+0x5e/0xe0 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcec3e87>] rpc_call_setup+0x57/0x60 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcec3edb>] rpc_do_run_task+0x4b/0xb0 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcec3fce>] rpc_call_sync+0x1e/0x40 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf53cba>] 
_nfs4_proc_lookupfh+0x6a/0xb0 [nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf55b89>] 
_nfs4_proc_lookup+0x49/0x1d0 [nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf55d42>] nfs4_proc_lookup+0x32/0x60 
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf3fe69>] nfs_lookup+0xa9/0x260 [nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf4226c>] 
nfs_atomic_lookup+0x10c/0x180 [nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0180190>] lookup_hash+0x90/0xb0
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0182f05>] do_unlinkat+0x85/0x140
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0182fd0>] sys_unlink+0x10/0x20
Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c01040de>] syscall_call+0x7/0xb
Jul 27 19:43:33 fb07-iapwap2 kernel:  =======================

this kernel is 2.6.23-rc1-mm1 with Trond's fix from this thread and this one. 
http://marc.info/?l=linux-kernel&m=118546628202419&w=2.

Problem (hang) is easy reproduceable here by just launching kde and open 
konsole.

> [ Unrelated, but I just realized from lib/Kconfig.debug that stuff in there
> "select"s LOCKDEP, which itself "selects" frame pointers (on some archs).
> But I thought "select" didn't chain properly like that? There's whole lots
> of depending and selecting and non-user-visible options out there in
> that file -- and I'm not sure all those selects and depends are really
> even required to successfully build (?) ]
>
>
> Satyam

marc


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 17:09   ` Trond Myklebust
  2007-07-27 17:16     ` Jeff Garzik
@ 2007-07-27 18:29     ` Marc Dietrich
  1 sibling, 0 replies; 15+ messages in thread
From: Marc Dietrich @ 2007-07-27 18:29 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jeff Garzik, Dr. J. Bruce Fields, Linux Kernel Mailing List,
	Andrew Morton, Michal Piotrowski


Hi,

Am Freitag, 27. Juli 2007 19:09 schrieb Trond Myklebust:
> On Fri, 2007-07-27 at 09:33 -0400, Trond Myklebust wrote:
> (...)
> Jeff and Bruce, could you please try to reproduce the problem after
> either applying patches 001 to 004 or just the single NFS_ALL patch
> from
>
>    http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc1/

as posted earlier, 001-002 does not fix it. Then I also applied 003+004 and it 
seems to be fixed now. kde boots up and I can start some programs, which I 
couldn't before. 

The "inconsistent lock state" error still exists, but it really seems to be 
unrelated.


> Cheers
>   Trond

thanks

Marc

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 17:58     ` Marc Dietrich
@ 2007-07-27 18:39       ` Trond Myklebust
  0 siblings, 0 replies; 15+ messages in thread
From: Trond Myklebust @ 2007-07-27 18:39 UTC (permalink / raw)
  To: Marc Dietrich
  Cc: Satyam Sharma, Jeff Garzik, kernel list, Andrew Morton,
	Michal Piotrowski

On Fri, 2007-07-27 at 19:58 +0200, Marc Dietrich wrote:
> Hi again,
> 
> Am Freitag, 27. Juli 2007 16:58 schrieb Satyam Sharma:
> > On 7/27/07, Marc Dietrich <Marc.Dietrich@ap.physik.uni-giessen.de> wrote:
> > > Hi,
> > >
> > > (...)
> > >
> > > me too, my server has 2.6.18-? (openSUSE 10.2). On the client
> > > (2.6.23-rc1-mm1), I also see (shortly before the hang)
> > >
> > > Jul 26 13:09:19 fb07-iapwap2 kernel: =================================
> > > Jul 26 13:09:19 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
> > > Jul 26 13:09:19 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
> > > Jul 26 13:09:19 fb07-iapwap2 kernel: ---------------------------------
> > > Jul 26 13:09:19 fb07-iapwap2 kernel: inconsistent {softirq-on-W} ->
> > > (cut away ugly dump)
> >
> > Doesn't appear related ... could you rebuild with CONFIG_FRAME_POINTER
> > and test + resend the above log? The trace up there looks quite ugly.
> 
> this seems to be better ...
> 
> Jul 27 19:43:33 fb07-iapwap2 kernel: =================================
> Jul 27 19:43:33 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
> Jul 27 19:43:33 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
> Jul 27 19:43:33 fb07-iapwap2 kernel: ---------------------------------
> Jul 27 19:43:33 fb07-iapwap2 kernel: inconsistent {softirq-on-W} -> 
> {in-softirq-W} usage.
> Jul 27 19:43:33 fb07-iapwap2 kernel: kdm/5909 [HC0[0]:SC1[1]:HE1:SE0] takes:
> Jul 27 19:43:33 fb07-iapwap2 kernel:  (rpc_credcache_lock){-+..}, at: 
> [<c01dba67>] _atomic_dec_and_lock+0x17/0x60
> Jul 27 19:43:33 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c013e800>] 
> __lock_acquire+0x650/0x1030
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c013f241>] lock_acquire+0x61/0x80
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c02db8dc>] _spin_lock+0x2c/0x40
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c01dba67>] 
> _atomic_dec_and_lock+0x17/0x60
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dceca5fd>] put_rpccred+0x5d/0x100 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dceca6c1>] 
> rpcauth_unbindcred+0x21/0x60 [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcec8fd4>] rpc_put_task+0x44/0xa0 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcec3fe0>] rpc_call_sync+0x30/0x40 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dced172b>] rpcb_register+0xdb/0x180 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcecb5b3>] svc_register+0x93/0x160 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcecbebe>] __svc_create+0x1ee/0x220 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcecc053>] svc_create+0x13/0x20 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf62752>] nfs_callback_up+0x82/0x120 
> [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf3df36>] nfs_get_client+0x176/0x390 
> [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf3e181>] nfs4_set_client+0x31/0x190 
> [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf3e983>] 
> nfs4_create_server+0x63/0x3b0 [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf47426>] nfs4_get_sb+0x346/0x5b0 
> [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c017aa74>] vfs_kern_mount+0x94/0x110
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c01900c2>] do_mount+0x1f2/0x7d0
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c0190706>] sys_mount+0x66/0xa0
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c01040de>] syscall_call+0x7/0xb
> Jul 27 19:43:33 fb07-iapwap2 kernel:   [<ffffffff>] 0xffffffff
> Jul 27 19:43:33 fb07-iapwap2 kernel: irq event stamp: 4916
> Jul 27 19:43:33 fb07-iapwap2 kernel: hardirqs last  enabled at (4916): 
> [<c017518a>] kmem_cache_free+0x8a/0xc0
> Jul 27 19:43:33 fb07-iapwap2 kernel: hardirqs last disabled at (4915): 
> [<c0175152>] kmem_cache_free+0x52/0xc0
> Jul 27 19:43:33 fb07-iapwap2 kernel: softirqs last  enabled at (4888): 
> [<dcec9ff1>] rpc_wake_up_next+0x81/0x180 [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel: softirqs last disabled at (4905): 
> [<c01241e7>] do_softirq+0x47/0x50
> Jul 27 19:43:33 fb07-iapwap2 kernel:
> Jul 27 19:43:33 fb07-iapwap2 kernel: other info that might help us debug this:
> Jul 27 19:43:33 fb07-iapwap2 kernel: 1 lock held by kdm/5909:
> Jul 27 19:43:33 fb07-iapwap2 kernel:  #0:  (&inode->i_mutex/1){--..}, at: 
> [<c0182efd>] do_unlinkat+0x7d/0x140
> Jul 27 19:43:33 fb07-iapwap2 kernel:
> Jul 27 19:43:33 fb07-iapwap2 kernel: stack backtrace:
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c010506a>] 
> show_trace_log_lvl+0x1a/0x30
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0105c92>] show_trace+0x12/0x20
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0105da5>] dump_stack+0x15/0x20
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c013cc53>] print_usage_bug+0x153/0x160
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c013d849>] mark_lock+0x449/0x620
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c013e7b4>] __lock_acquire+0x604/0x1030
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c013f241>] lock_acquire+0x61/0x80
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c02db8dc>] _spin_lock+0x2c/0x40
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c01dba67>] 
> _atomic_dec_and_lock+0x17/0x60
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dceca5fd>] put_rpccred+0x5d/0x100 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf60fb3>] 
> nfs_free_delegation_callback+0x13/0x20 [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c012fa3a>] 
> __rcu_process_callbacks+0x6a/0x1c0
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c012fba2>] 
> rcu_process_callbacks+0x12/0x30
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0124228>] tasklet_action+0x38/0x80
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0124135>] __do_softirq+0x55/0xc0
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c01241e7>] do_softirq+0x47/0x50
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0124615>] irq_exit+0x35/0x40
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0112463>] 
> smp_apic_timer_interrupt+0x43/0x80
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0104b0f>] 
> apic_timer_interrupt+0x33/0x38
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcecab53>] 
> rpcauth_lookup_credcache+0x83/0x1c0 [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcecb1a8>] unx_lookup_cred+0x8/0x10 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dceca8fe>] rpcauth_bindcred+0x5e/0xe0 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcec3e87>] rpc_call_setup+0x57/0x60 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcec3edb>] rpc_do_run_task+0x4b/0xb0 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcec3fce>] rpc_call_sync+0x1e/0x40 
> [sunrpc]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf53cba>] 
> _nfs4_proc_lookupfh+0x6a/0xb0 [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf55b89>] 
> _nfs4_proc_lookup+0x49/0x1d0 [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf55d42>] nfs4_proc_lookup+0x32/0x60 
> [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf3fe69>] nfs_lookup+0xa9/0x260 [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<dcf4226c>] 
> nfs_atomic_lookup+0x10c/0x180 [nfs]
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0180190>] lookup_hash+0x90/0xb0
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0182f05>] do_unlinkat+0x85/0x140
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c0182fd0>] sys_unlink+0x10/0x20
> Jul 27 19:43:33 fb07-iapwap2 kernel:  [<c01040de>] syscall_call+0x7/0xb
> Jul 27 19:43:33 fb07-iapwap2 kernel:  =======================
> 
> this kernel is 2.6.23-rc1-mm1 with Trond's fix from this thread and this one. 
> http://marc.info/?l=linux-kernel&m=118546628202419&w=2.
> 
> Problem (hang) is easy reproduceable here by just launching kde and open 
> konsole.

That is very odd. I suspect the memory shrinker, but am at a loss to
explain why that would be called from a softirq context.

Could you please see if you can reproduce the above problem with
2.6.23-rc1 + the 4 patches?

Cheers
  Trond

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-07-27 17:29       ` Trond Myklebust
@ 2007-08-03 19:13         ` J. Bruce Fields
  2007-08-06 13:40           ` J. Bruce Fields
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2007-08-03 19:13 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jeff Garzik, Linux Kernel Mailing List, Andrew Morton,
	Michal Piotrowski

On Fri, Jul 27, 2007 at 01:29:00PM -0400, Trond Myklebust wrote:
> On Fri, 2007-07-27 at 13:16 -0400, Jeff Garzik wrote:
> > Trond Myklebust wrote:
> > > Jeff and Bruce, could you please try to reproduce the problem after
> > > either applying patches 001 to 004 or just the single NFS_ALL patch
> > > from 
> > > 
> > >    http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc1/
> > 
> > 
> > It's not easily reproducible here.  Any hints on triggering the behavior?
> 
> I suspect that the problem may involve mixing O_RDWR with O_RDONLY
> and/or O_WRONLY file access. Bruce might have a better suggestion,
> though.

I haven't done any more work to identify when exactly the problem is
triggered, but I did confirm that I could reproduce the problem reliably
with current git, but *not* with current git plus your 4-patch nfs-all.

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-08-03 19:13         ` J. Bruce Fields
@ 2007-08-06 13:40           ` J. Bruce Fields
  2007-08-06 18:40             ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: J. Bruce Fields @ 2007-08-06 13:40 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jeff Garzik, Linux Kernel Mailing List, Andrew Morton,
	Michal Piotrowski

On Fri, Aug 03, 2007 at 03:13:36PM -0400, J. Bruce Fields wrote:
> On Fri, Jul 27, 2007 at 01:29:00PM -0400, Trond Myklebust wrote:
> > On Fri, 2007-07-27 at 13:16 -0400, Jeff Garzik wrote:
> > > Trond Myklebust wrote:
> > > > Jeff and Bruce, could you please try to reproduce the problem after
> > > > either applying patches 001 to 004 or just the single NFS_ALL patch
> > > > from 
> > > > 
> > > >    http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc1/
> > > 
> > > 
> > > It's not easily reproducible here.  Any hints on triggering the behavior?
> > 
> > I suspect that the problem may involve mixing O_RDWR with O_RDONLY
> > and/or O_WRONLY file access. Bruce might have a better suggestion,
> > though.
> 
> I haven't done any more work to identify when exactly the problem is
> triggered, but I did confirm that I could reproduce the problem reliably
> with current git, but *not* with current git plus your 4-patch nfs-all.

And that result still holds after using the patched kernel for several
more hours.

--b.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFSv4 poops itself
  2007-08-06 13:40           ` J. Bruce Fields
@ 2007-08-06 18:40             ` Trond Myklebust
  0 siblings, 0 replies; 15+ messages in thread
From: Trond Myklebust @ 2007-08-06 18:40 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Jeff Garzik, Linux Kernel Mailing List, Andrew Morton,
	Michal Piotrowski

On Mon, 2007-08-06 at 09:40 -0400, J. Bruce Fields wrote:
> On Fri, Aug 03, 2007 at 03:13:36PM -0400, J. Bruce Fields wrote:
> > On Fri, Jul 27, 2007 at 01:29:00PM -0400, Trond Myklebust wrote:
> > > On Fri, 2007-07-27 at 13:16 -0400, Jeff Garzik wrote:
> > > > Trond Myklebust wrote:
> > > > > Jeff and Bruce, could you please try to reproduce the problem after
> > > > > either applying patches 001 to 004 or just the single NFS_ALL patch
> > > > > from 
> > > > > 
> > > > >    http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc1/
> > > > 
> > > > 
> > > > It's not easily reproducible here.  Any hints on triggering the behavior?
> > > 
> > > I suspect that the problem may involve mixing O_RDWR with O_RDONLY
> > > and/or O_WRONLY file access. Bruce might have a better suggestion,
> > > though.
> > 
> > I haven't done any more work to identify when exactly the problem is
> > triggered, but I did confirm that I could reproduce the problem reliably
> > with current git, but *not* with current git plus your 4-patch nfs-all.
> 
> And that result still holds after using the patched kernel for several
> more hours.

Thanks Bruce!
  Trond

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-08-06 18:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-27 12:53 NFSv4 poops itself Jeff Garzik
2007-07-27 13:33 ` Trond Myklebust
2007-07-27 17:09   ` Trond Myklebust
2007-07-27 17:16     ` Jeff Garzik
2007-07-27 17:29       ` Trond Myklebust
2007-08-03 19:13         ` J. Bruce Fields
2007-08-06 13:40           ` J. Bruce Fields
2007-08-06 18:40             ` Trond Myklebust
2007-07-27 18:29     ` Marc Dietrich
2007-07-27 13:36 ` Marc Dietrich
2007-07-27 13:40   ` Jeff Garzik
2007-07-27 14:45     ` Trond Myklebust
2007-07-27 14:58   ` Satyam Sharma
2007-07-27 17:58     ` Marc Dietrich
2007-07-27 18:39       ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox