public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* panic: "attempting to free lock on active lock list"
@ 2006-04-12 19:24 Joe Pranevich
  0 siblings, 0 replies; only message in thread
From: Joe Pranevich @ 2006-04-12 19:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: jpranevich

Hello,

I'm running a RedHat/Centos-modified 2.6.9 (-34.EL) in an extremely busy 
web-service-on-NFS environment. (Tons of small files, user homepages, 
and other things I try not to consider.) I know I'm not running the 
latest 2.6.16 kernel on these boxes, so if the immediate response is to 
go back and do that, I will do so. However, I am really hoping that one 
of the folks here will see this message and say "I remember that bug!" 
and be able to point me at a patch. I've done a thorough mailing list 
search and have tried some of the suggestions that I found, so please 
read below.

My systems are all SMP and I can reproduce this on 2x 800mhz Compaq 
boxes and 2x 1.3ghz IBM boxes. I haven't tried it on any UP boxes, nor 
have I tried it with a UP kernel on a SMP machine.

The boxes were hanging hard with a "attempting to free lock on active 
lock list" panic message, with no further debugging information. On 
digging around the mailing lists, my best guess is a poor interaction 
between NFS and the FS layer, but it could also be just collateral 
damage from some other problem. Per a suggestion from Chris Wright on 
LKML back in Jan 2005, I changed the panic to do BUG_ONs, to print more 
diagnostics. That gives me this crash dump:

Apr 11 14:15:01 bos-tri-members36 kernel: Attempting to free lock on active lock list------------[ cut here ]------------
Apr 11 14:15:01 bos-tri-members36 kernel: kernel BUG at fs/locks.c:173!
Apr 11 14:15:01 bos-tri-members36 kernel: invalid operand: 0000 [#1]
Apr 11 14:15:01 bos-tri-members36 kernel: SMP 
Apr 11 14:15:01 bos-tri-members36 kernel: Modules linked in: iptable_filter ip_tables md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd sunrpc dm_mirror dm_mod button battery ac ohci_hcd tg3 floppy sg ext3 jbd mptscsih mptbase sd_mod scsi_mod
Apr 11 14:15:01 bos-tri-members36 kernel: CPU:    1
Apr 11 14:15:01 bos-tri-members36 kernel: EIP:    0060:[<c016c0ba>]    Not tainted VLI
Apr 11 14:15:01 bos-tri-members36 kernel: EFLAGS: 00010216   (2.6.9-22.0.3.EL.lycossmp) 
Apr 11 14:15:01 bos-tri-members36 kernel: EIP is at __posix_lock_file+0x56a/0x5b6
Apr 11 14:15:01 bos-tri-members36 kernel: eax: 0000002b   ebx: c02e6108   ecx: f5b19ee0   edx: c02e6108
Apr 11 14:15:01 bos-tri-members36 kernel: esi: 00000000   edi: d2535c0c   ebp: 00000000   esp: f5b19ee0
Apr 11 14:15:01 bos-tri-members36 kernel: ds: 007b   es: 007b   ss: 0068
Apr 11 14:15:01 bos-tri-members36 kernel: Process sendmail (pid: 23465, threadinfo=f5b19000 task=f4a32030)
Apr 11 14:15:01 bos-tri-members36 kernel: Stack: f7f55f80 e24cbd5c 00000000 00000000 00000000 00000000 00000000 00000000 
Apr 11 14:15:01 bos-tri-members36 kernel:        f5b80c68 00000000 00000000 d2535c0c 00000000 d2535c0c d2535c0c f5b19f78 
Apr 11 14:15:01 bos-tri-members36 kernel:        00000000 ce085c80 c016cfb1 00000000 f5b80bc0 00000007 443bf225 00000000 
Apr 11 14:15:01 bos-tri-members36 kernel: Call Trace:
Apr 11 14:15:01 bos-tri-members36 kernel:  [<c016cfb1>] fcntl_setlk+0x169/0x2b2
Apr 11 14:15:01 bos-tri-members36 kernel:  [<c0161f9b>] sys_fstat64+0x1e/0x23
Apr 11 14:15:01 bos-tri-members36 kernel:  [<c0169307>] do_fcntl+0x10c/0x155
Apr 11 14:15:01 bos-tri-members36 kernel:  [<c0169416>] sys_fcntl64+0x6c/0x7d
Apr 11 14:15:02 bos-tri-members36 kernel:  [<c02d14d7>] syscall_call+0x7/0xb
Apr 11 14:15:02 bos-tri-members36 kernel:  [<c02d007b>] _spin_lock+0x2e/0x34
Apr 11 14:15:02 bos-tri-members36 kernel: Code: 2e c0 e8 ac 61 fb ff 5f 0f 0b a8 00 ce 60 2e c0 8b 44 24 2c 8b 7c 24 2c 83 c0 04 39 47 04 74 13 68 08 61 2e c0 e8 89 61 fb ff 5b <0f> 0b ad 00 ce 60 2e c0 8b 54 24 2c 8b 42 4c 85 c0 74 18 8b 50 
Apr 11 14:15:02 bos-tri-members36 kernel:  <0>Fatal exception: panic in 5 seconds

Later in that same thread, Trond Myklebust provided a patch to change 
posix_lock_file() to posix_lock_file_wait() and that solved that 
specific user's problems. However, that small patch is already 
back-ported to this 2.6.9 kernel. (You know how those distributions 
are...) I've searched the mailing lists for other applicable posts, but 
have come up empty.

I am running a very large number of servers running this same version of 
Centos and only the machines with the highest NFS loads appear to 
trigger this problem. However, I'm not positive that it is load 
related... reducing the load by 30% (by adding more servers) didn't do 
the trick and it's difficult for me to tell if the problem just happens 
less now. (I have an ad-hoc watchdog system that flips the power on the 
boxes several minutes after they stop responding to pings. Desperate 
times...) I also have many other servers that do similar workloads that 
are unaffected, so it could be a specific condition in our software 
which is causing the panic(), but I can't trace it.

Thinking it was NFS, we have tried a variety of combinations including 
using "nolock", dialing down to NFSv2, using UDP-only, etc. None of 
those change the situation. I haven't taken the machine down to one 
processor because they tend to melt under the load, but I can try. I've 
also found a mailing list post that suggested to use an older 
RedHat/Centos kernel which worked for someone else, but that didn't work 
for me, either. (The NFS server is a NetApp, but I don't have any other 
units I can test against with the same load characteristics. There are 
many TB of data involved.)

Does anyone have any suggestions? I don't mind getting my hands dirty 
testing ideas.

Thanks very much for your assistance,

Joe Pranevich


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2006-04-12 19:24 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-12 19:24 panic: "attempting to free lock on active lock list" Joe Pranevich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox