mountd randomly crash and panic the server

All of lore.kernel.org
 help / color / mirror / Atom feed

* mountd randomly crash and panic the server
@ 2007-04-11 15:11 Gabriel Barazer
  2007-04-16  9:26 ` Gabriel Barazer
  0 siblings, 1 reply; 10+ messages in thread
From: Gabriel Barazer @ 2007-04-11 15:11 UTC (permalink / raw)
  To: nfs

Hello there,

I recently got a problem probably due to rpc.mountd (from nfs-utils 
1.0.7). I have a file server running linux 2.6.20.3 and it crashed 4 
times since march, 1st under a quite heavy load with the following 
kernel errors in the syslog (which I have read after the crash, because 
kernel probably panic()ed ano network access was possible)

Mar 19 19:52:57 filer1 kernel: general protection fault: 0000 [1] SMP
Mar 19 19:52:57 filer1 kernel: CPU 3
Mar 19 19:52:57 filer1 kernel: Modules linked in:
Mar 19 19:52:57 filer1 kernel: Pid: 2238, comm: rpc.mountd Not tainted 2.6.20.3 #1
Mar 19 19:52:57 filer1 kernel: RIP: 0010:[<ffffffff8053606f>]  [<ffffffff8053606f>] cache_clean+0x11e/0x22f
Mar 19 19:52:57 filer1 kernel: RSP: 0018:ffff810068da7b68  EFLAGS: 00010206
Mar 19 19:52:57 filer1 kernel: RAX: ffffffff806a0e80 RBX: 0001e71926010009 RCX: ffffffff806a0e80
Mar 19 19:52:57 filer1 kernel: RDX: ffffffff806a0e80 RSI: 0000000000000071 RDI: ffffffff806a0e98
Mar 19 19:52:57 filer1 kernel: RBP: ffff8100090ad7c0 R08: 0000000000000000 R09: 000000000000006b
Mar 19 19:52:57 filer1 kernel: R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000000
Mar 19 19:52:57 filer1 kernel: R13: ffff810068da7e08 R14: 0000000000000006 R15: ffff81001669c000
Mar 19 19:52:57 filer1 kernel: FS:  00002ab0119c36e0(0000) GS:ffff81007ff397c0(0000) knlGS:0000000000000000
Mar 19 19:52:57 filer1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 19 19:52:57 filer1 kernel: CR2: 00002ab011474000 CR3: 000000007e228000 CR4: 00000000000006e0
Mar 19 19:52:57 filer1 kernel: Process rpc.mountd (pid: 2238, threadinfo ffff810068da6000, task ffff81006b5a6400)
Mar 19 19:52:57 filer1 kernel: Stack:  ffff810050a9ae40 0000000000000000 ffff810068da7da8 ffffffff8053618d
Mar 19 19:52:57 filer1 kernel:  ffff810050a9ae40 ffffffff803489f8 00000000fffffffa ffff810027312840
Mar 19 19:52:57 filer1 kernel:  ffff810025bb2400 ffff810025bb2400 ffff810068da7c69 ffff810068da7caa
Mar 19 19:52:57 filer1 kernel: Call Trace:
Mar 19 19:52:57 filer1 kernel:  [<ffffffff8053618d>] cache_flush+0xd/0x23
Mar 19 19:52:57 filer1 kernel:  [<ffffffff803489f8>] svc_export_parse+0x5c2/0x650
Mar 19 19:52:57 filer1 kernel:  [<ffffffff8020d338>] current_fs_time+0x4d/0x52
Mar 19 19:52:57 filer1 kernel:  [<ffffffff803dc1a5>] elv_next_request+0x40/0x14f
Mar 19 19:52:57 filer1 kernel:  [<ffffffff80224ccd>] sync_page+0x0/0x41
Mar 19 19:52:57 filer1 kernel:  [<ffffffff805367fe>] cache_write+0x90/0xac
Mar 19 19:52:57 filer1 kernel:  [<ffffffff80213f40>] vfs_write+0xaf/0x151
Mar 19 19:52:57 filer1 kernel:  [<ffffffff802148d0>] sys_write+0x45/0x6e
Mar 19 19:52:57 filer1 kernel:  [<ffffffff8025411e>] system_call+0x7e/0x83
Mar 19 19:52:57 filer1 kernel:
Mar 19 19:52:57 filer1 kernel:
Mar 19 19:52:57 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 0a 48 ff c0 48 89 82 80
Mar 19 19:52:57 filer1 kernel: RIP  [<ffffffff8053606f>] cache_clean+0x11e/0x22f
Mar 19 19:52:57 filer1 kernel:  RSP <ffff810068da7b68>

The three other times , there was this error :

Apr  2 18:30:00 filer1 kernel: general protection fault: 0000 [1] SMP
Apr  2 18:30:00 filer1 kernel: CPU 2
Apr  2 18:30:00 filer1 kernel: Modules linked in:
Apr  2 18:30:00 filer1 kernel: Pid: 891, comm: rpc.mountd Not tainted 2.6.20.3 #1
Apr  2 18:30:00 filer1 kernel: RIP: 0010:[<ffffffff8053606f>]  [<ffffffff8053606f>] cache_clean+0x11e/0x22f
Apr  2 18:30:00 filer1 kernel: RSP: 0018:ffff8100778ffe18  EFLAGS: 00010202
Apr  2 18:30:00 filer1 kernel: RAX: ffffffff806a0e80 RBX: 2d305f742d315f61 RCX: ffffffff806a0e80
Apr  2 18:30:00 filer1 kernel: RDX: ffffffff806a0e80 RSI: 0000000000000070 RDI: ffffffff806a0e98
Apr  2 18:30:00 filer1 kernel: RBP: ffff810008664c40 R08: ffffffff807783a0 R09: 00000000000000fc
Apr  2 18:30:00 filer1 kernel: R10: 000000000000000b R11: 0000000000000027 R12: 0000000000000000
Apr  2 18:30:00 filer1 kernel: R13: ffff8100778ffe78 R14: 0000000000000001 R15: 000000000040c62b
Apr  2 18:30:00 filer1 kernel: FS:  00002abf4f2526e0(0000) GS:ffff81007ff39e40(0000) knlGS:0000000000000000
Apr  2 18:30:00 filer1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr  2 18:30:00 filer1 kernel: CR2: 00002b3251f83000 CR3: 000000003266e000 CR4: 00000000000006e0
Apr  2 18:30:00 filer1 kernel: Process rpc.mountd (pid: 891, threadinfo ffff8100778fe000, task ffff810077fd03c0)
Apr  2 18:30:00 filer1 kernel: Stack:  ffff810079cca840 0000000000000000 000000004611368f ffffffff8053618d
Apr  2 18:30:00 filer1 kernel:  0000000000000000 ffffffff80533a01 ffff8100778ffe77 0000000002cd00e3
Apr  2 18:30:00 filer1 kernel:  ffff8100778ffe92 0000000a000081a4 0000000000000000 000000000000002f
Apr  2 18:30:00 filer1 kernel: Call Trace:
Apr  2 18:30:00 filer1 kernel:  [<ffffffff8053618d>] cache_flush+0xd/0x23
Apr  2 18:30:00 filer1 kernel:  [<ffffffff80533a01>] ip_map_parse+0x17c/0x18e
Apr  2 18:30:00 filer1 kernel:  [<ffffffff805367fe>] cache_write+0x90/0xac
Apr  2 18:30:00 filer1 kernel:  [<ffffffff80213f40>] vfs_write+0xaf/0x151
Apr  2 18:30:00 filer1 kernel:  [<ffffffff802148d0>] sys_write+0x45/0x6e
Apr  2 18:30:00 filer1 kernel:  [<ffffffff8025411e>] system_call+0x7e/0x83
Apr  2 18:30:00 filer1 kernel:
Apr  2 18:30:00 filer1 kernel:
Apr  2 18:30:00 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 0a 48 ff c0 48 89 82 80
Apr  2 18:30:00 filer1 kernel: RIP  [<ffffffff8053606f>] cache_clean+0x11e/0x22f
Apr  2 18:30:00 filer1 kernel:  RSP <ffff8100778ffe18>

And this one (only call trace included):

Mar 17 17:08:34 filer1 kernel: general protection fault: 0000 [1] SMP
Mar 17 17:08:34 filer1 kernel: CPU 3
Mar 17 17:08:34 filer1 kernel: Pid: 13, comm: events/3 Not tainted 2.6.20 #5
Mar 17 17:08:34 filer1 kernel: RIP: 0010:[<ffffffff8051f81b>]  [<ffffffff8051f81b>] cache_clean+0x11e/0x22f
Mar 17 17:08:34 filer1 kernel: RSP: 0018:ffff81007fe11e40  EFLAGS: 00010202
Mar 17 17:08:34 filer1 kernel: RAX: ffffffff80660080 RBX: 2d305f742d315f61 RCX: ffffffff80660080
Mar 17 17:08:34 filer1 kernel: RDX: ffffffff80660080 RSI: 0000000000000077 RDI: ffffffff80660098
Mar 17 17:08:34 filer1 kernel: RBP: ffff81001e9b5e40 R08: ffff81007fe10000 R09: 0000000000000286
Mar 17 17:08:34 filer1 kernel: R10: 0000000000000286 R11: ffffffff80240992 R12: 0000000000000000
Mar 17 17:08:34 filer1 kernel: R13: 0000000000000206 R14: ffffffff8051fab1 R15: 0000000000000000
Mar 17 17:08:34 filer1 kernel: FS:  0000000000000000(0000) GS:ffff810002f817c0(0000) knlGS:0000000000000000
Mar 17 17:08:34 filer1 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Mar 17 17:08:34 filer1 kernel: CR2: 00002b6b14d8bbc0 CR3: 000000007b57c000 CR4: 00000000000006e0
Mar 17 17:08:34 filer1 kernel: Process events/3 (pid: 13, threadinfo ffff81007fe10000, task ffff81007fe3ef00)
Mar 17 17:08:34 filer1 kernel: Stack:  0000000000000005 ffff810002f815c0 ffffffff806602c0 ffffffff8051fabc
Mar 17 17:08:34 filer1 kernel:  ffffffff806602c8 ffffffff80244799 ffff810002f815c0 ffffffff802415b3
Mar 17 17:08:34 filer1 kernel:  ffff81007ff23d60 00000000fffffffc 0000000000000000 ffffffff802416d7
Mar 17 17:08:34 filer1 kernel: Call Trace:
Mar 17 17:08:34 filer1 kernel:  [<ffffffff8051fabc>] do_cache_clean+0xb/0x38
Mar 17 17:08:34 filer1 kernel:  [<ffffffff80244799>] run_workqueue+0x9f/0x13b
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802415b3>] worker_thread+0x0/0x15a
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802416d7>] worker_thread+0x124/0x15a
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802731fd>] default_wake_function+0x0/0xe
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802731fd>] default_wake_function+0x0/0xe
Mar 17 17:08:35 filer1 kernel:  [<ffffffff8022d1c6>] kthread+0xc8/0xf1
Mar 17 17:08:35 filer1 kernel:  [<ffffffff80252ec8>] child_rip+0xa/0x12
Mar 17 17:08:35 filer1 kernel:  [<ffffffff802838e8>] kthread_create+0x6a/0x147
Mar 17 17:08:35 filer1 kernel:  [<ffffffff8022d0fe>] kthread+0x0/0xf1
Mar 17 17:08:35 filer1 kernel:  [<ffffffff80252ebe>] child_rip+0x0/0x12
Mar 17 17:08:35 filer1 kernel:
Mar 17 17:08:35 filer1 kernel:
Mar 17 17:08:35 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 0a 48 ff c0 48 89 82 80
Mar 17 17:08:35 filer1 kernel: RIP  [<ffffffff8051f81b>] cache_clean+0x11e/0x22f
Mar 17 17:08:35 filer1 kernel:  RSP <ffff81007fe11e40>

Notice the same end related to the "cache_clean" function, maybe this means something for anyone ?

Additionally, I noticed that today the rpc.mountd process (from nfs-utils 1.1.0-rc1) is 350M big as the "ps" command shows :
root      2091  0.0 17.2 364164 354664 ?     Ss   Apr03   0:27 rpc.mountd --no-nfs-version 2

And it seems to very slowly grow up in memory. /proc/<pid of rpc.mountd>/smaps shows that 350M are taken by the "heap" :
00514000-15eea000 rw-p 00514000 00:00 0                                  [heap]
Size:            354136 kB
Rss:             353756 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:   353756 kB

I don't know if these 2 cases are related, but as I got a crash with rpc.mountd (see crash from apr. 2nd), I include it too.

Any thoughts ?

Regards,

Gabriel


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* mountd randomly crash and panic the server
@ 2007-04-11 15:12 Gabriel Barazer
  0 siblings, 0 replies; 10+ messages in thread
From: Gabriel Barazer @ 2007-04-11 15:12 UTC (permalink / raw)
  To: nfs

Hello there,

I recently got a problem probably due to rpc.mountd (from nfs-utils
1.0.7). I have a file server running linux 2.6.20.3 and it crashed 4
times since march, 1st under a quite heavy load with the following
kernel errors in the syslog (which I have read after the crash, because
kernel probably panic()ed ano network access was possible)

Mar 19 19:52:57 filer1 kernel: general protection fault: 0000 [1] SMP
Mar 19 19:52:57 filer1 kernel: CPU 3
Mar 19 19:52:57 filer1 kernel: Modules linked in:
Mar 19 19:52:57 filer1 kernel: Pid: 2238, comm: rpc.mountd Not tainted 
2.6.20.3 #1
Mar 19 19:52:57 filer1 kernel: RIP: 0010:[<ffffffff8053606f>] 
[<ffffffff8053606f>] cache_clean+0x11e/0x22f
Mar 19 19:52:57 filer1 kernel: RSP: 0018:ffff810068da7b68  EFLAGS: 00010206
Mar 19 19:52:57 filer1 kernel: RAX: ffffffff806a0e80 RBX: 
0001e71926010009 RCX: ffffffff806a0e80
Mar 19 19:52:57 filer1 kernel: RDX: ffffffff806a0e80 RSI: 
0000000000000071 RDI: ffffffff806a0e98
Mar 19 19:52:57 filer1 kernel: RBP: ffff8100090ad7c0 R08: 
0000000000000000 R09: 000000000000006b
Mar 19 19:52:57 filer1 kernel: R10: 0000000000000000 R11: 
0000000000000005 R12: 0000000000000000
Mar 19 19:52:57 filer1 kernel: R13: ffff810068da7e08 R14: 
0000000000000006 R15: ffff81001669c000
Mar 19 19:52:57 filer1 kernel: FS:  00002ab0119c36e0(0000) 
GS:ffff81007ff397c0(0000) knlGS:0000000000000000
Mar 19 19:52:57 filer1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Mar 19 19:52:57 filer1 kernel: CR2: 00002ab011474000 CR3: 
000000007e228000 CR4: 00000000000006e0
Mar 19 19:52:57 filer1 kernel: Process rpc.mountd (pid: 2238, threadinfo 
ffff810068da6000, task ffff81006b5a6400)
Mar 19 19:52:57 filer1 kernel: Stack:  ffff810050a9ae40 0000000000000000 
ffff810068da7da8 ffffffff8053618d
Mar 19 19:52:57 filer1 kernel:  ffff810050a9ae40 ffffffff803489f8 
00000000fffffffa ffff810027312840
Mar 19 19:52:57 filer1 kernel:  ffff810025bb2400 ffff810025bb2400 
ffff810068da7c69 ffff810068da7caa
Mar 19 19:52:57 filer1 kernel: Call Trace:
Mar 19 19:52:57 filer1 kernel:  [<ffffffff8053618d>] cache_flush+0xd/0x23
Mar 19 19:52:57 filer1 kernel:  [<ffffffff803489f8>] 
svc_export_parse+0x5c2/0x650
Mar 19 19:52:57 filer1 kernel:  [<ffffffff8020d338>] 
current_fs_time+0x4d/0x52
Mar 19 19:52:57 filer1 kernel:  [<ffffffff803dc1a5>] 
elv_next_request+0x40/0x14f
Mar 19 19:52:57 filer1 kernel:  [<ffffffff80224ccd>] sync_page+0x0/0x41
Mar 19 19:52:57 filer1 kernel:  [<ffffffff805367fe>] cache_write+0x90/0xac
Mar 19 19:52:57 filer1 kernel:  [<ffffffff80213f40>] vfs_write+0xaf/0x151
Mar 19 19:52:57 filer1 kernel:  [<ffffffff802148d0>] sys_write+0x45/0x6e
Mar 19 19:52:57 filer1 kernel:  [<ffffffff8025411e>] system_call+0x7e/0x83
Mar 19 19:52:57 filer1 kernel:
Mar 19 19:52:57 filer1 kernel:
Mar 19 19:52:57 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 
0a 48 ff c0 48 89 82 80
Mar 19 19:52:57 filer1 kernel: RIP  [<ffffffff8053606f>] 
cache_clean+0x11e/0x22f
Mar 19 19:52:57 filer1 kernel:  RSP <ffff810068da7b68>

The three other times , there was this error :

Apr  2 18:30:00 filer1 kernel: general protection fault: 0000 [1] SMP
Apr  2 18:30:00 filer1 kernel: CPU 2
Apr  2 18:30:00 filer1 kernel: Modules linked in:
Apr  2 18:30:00 filer1 kernel: Pid: 891, comm: rpc.mountd Not tainted 
2.6.20.3 #1
Apr  2 18:30:00 filer1 kernel: RIP: 0010:[<ffffffff8053606f>] 
[<ffffffff8053606f>] cache_clean+0x11e/0x22f
Apr  2 18:30:00 filer1 kernel: RSP: 0018:ffff8100778ffe18  EFLAGS: 00010202
Apr  2 18:30:00 filer1 kernel: RAX: ffffffff806a0e80 RBX: 
2d305f742d315f61 RCX: ffffffff806a0e80
Apr  2 18:30:00 filer1 kernel: RDX: ffffffff806a0e80 RSI: 
0000000000000070 RDI: ffffffff806a0e98
Apr  2 18:30:00 filer1 kernel: RBP: ffff810008664c40 R08: 
ffffffff807783a0 R09: 00000000000000fc
Apr  2 18:30:00 filer1 kernel: R10: 000000000000000b R11: 
0000000000000027 R12: 0000000000000000
Apr  2 18:30:00 filer1 kernel: R13: ffff8100778ffe78 R14: 
0000000000000001 R15: 000000000040c62b
Apr  2 18:30:00 filer1 kernel: FS:  00002abf4f2526e0(0000) 
GS:ffff81007ff39e40(0000) knlGS:0000000000000000
Apr  2 18:30:00 filer1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Apr  2 18:30:00 filer1 kernel: CR2: 00002b3251f83000 CR3: 
000000003266e000 CR4: 00000000000006e0
Apr  2 18:30:00 filer1 kernel: Process rpc.mountd (pid: 891, threadinfo 
ffff8100778fe000, task ffff810077fd03c0)
Apr  2 18:30:00 filer1 kernel: Stack:  ffff810079cca840 0000000000000000 
000000004611368f ffffffff8053618d
Apr  2 18:30:00 filer1 kernel:  0000000000000000 ffffffff80533a01 
ffff8100778ffe77 0000000002cd00e3
Apr  2 18:30:00 filer1 kernel:  ffff8100778ffe92 0000000a000081a4 
0000000000000000 000000000000002f
Apr  2 18:30:00 filer1 kernel: Call Trace:
Apr  2 18:30:00 filer1 kernel:  [<ffffffff8053618d>] cache_flush+0xd/0x23
Apr  2 18:30:00 filer1 kernel:  [<ffffffff80533a01>] 
ip_map_parse+0x17c/0x18e
Apr  2 18:30:00 filer1 kernel:  [<ffffffff805367fe>] cache_write+0x90/0xac
Apr  2 18:30:00 filer1 kernel:  [<ffffffff80213f40>] vfs_write+0xaf/0x151
Apr  2 18:30:00 filer1 kernel:  [<ffffffff802148d0>] sys_write+0x45/0x6e
Apr  2 18:30:00 filer1 kernel:  [<ffffffff8025411e>] system_call+0x7e/0x83
Apr  2 18:30:00 filer1 kernel:
Apr  2 18:30:00 filer1 kernel:
Apr  2 18:30:00 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 
0a 48 ff c0 48 89 82 80
Apr  2 18:30:00 filer1 kernel: RIP  [<ffffffff8053606f>] 
cache_clean+0x11e/0x22f
Apr  2 18:30:00 filer1 kernel:  RSP <ffff8100778ffe18>

And this one (only call trace included):

Mar 17 17:08:34 filer1 kernel: general protection fault: 0000 [1] SMP
Mar 17 17:08:34 filer1 kernel: CPU 3
Mar 17 17:08:34 filer1 kernel: Pid: 13, comm: events/3 Not tainted 2.6.20 #5
Mar 17 17:08:34 filer1 kernel: RIP: 0010:[<ffffffff8051f81b>] 
[<ffffffff8051f81b>] cache_clean+0x11e/0x22f
Mar 17 17:08:34 filer1 kernel: RSP: 0018:ffff81007fe11e40  EFLAGS: 00010202
Mar 17 17:08:34 filer1 kernel: RAX: ffffffff80660080 RBX: 
2d305f742d315f61 RCX: ffffffff80660080
Mar 17 17:08:34 filer1 kernel: RDX: ffffffff80660080 RSI: 
0000000000000077 RDI: ffffffff80660098
Mar 17 17:08:34 filer1 kernel: RBP: ffff81001e9b5e40 R08: 
ffff81007fe10000 R09: 0000000000000286
Mar 17 17:08:34 filer1 kernel: R10: 0000000000000286 R11: 
ffffffff80240992 R12: 0000000000000000
Mar 17 17:08:34 filer1 kernel: R13: 0000000000000206 R14: 
ffffffff8051fab1 R15: 0000000000000000
Mar 17 17:08:34 filer1 kernel: FS:  0000000000000000(0000) 
GS:ffff810002f817c0(0000) knlGS:0000000000000000
Mar 17 17:08:34 filer1 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
000000008005003b
Mar 17 17:08:34 filer1 kernel: CR2: 00002b6b14d8bbc0 CR3: 
000000007b57c000 CR4: 00000000000006e0
Mar 17 17:08:34 filer1 kernel: Process events/3 (pid: 13, threadinfo 
ffff81007fe10000, task ffff81007fe3ef00)
Mar 17 17:08:34 filer1 kernel: Stack:  0000000000000005 ffff810002f815c0 
ffffffff806602c0 ffffffff8051fabc
Mar 17 17:08:34 filer1 kernel:  ffffffff806602c8 ffffffff80244799 
ffff810002f815c0 ffffffff802415b3
Mar 17 17:08:34 filer1 kernel:  ffff81007ff23d60 00000000fffffffc 
0000000000000000 ffffffff802416d7
Mar 17 17:08:34 filer1 kernel: Call Trace:
Mar 17 17:08:34 filer1 kernel:  [<ffffffff8051fabc>] do_cache_clean+0xb/0x38
Mar 17 17:08:34 filer1 kernel:  [<ffffffff80244799>] 
run_workqueue+0x9f/0x13b
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802415b3>] worker_thread+0x0/0x15a
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802416d7>] 
worker_thread+0x124/0x15a
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802731fd>] 
default_wake_function+0x0/0xe
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802731fd>] 
default_wake_function+0x0/0xe
Mar 17 17:08:35 filer1 kernel:  [<ffffffff8022d1c6>] kthread+0xc8/0xf1
Mar 17 17:08:35 filer1 kernel:  [<ffffffff80252ec8>] child_rip+0xa/0x12
Mar 17 17:08:35 filer1 kernel:  [<ffffffff802838e8>] 
kthread_create+0x6a/0x147
Mar 17 17:08:35 filer1 kernel:  [<ffffffff8022d0fe>] kthread+0x0/0xf1
Mar 17 17:08:35 filer1 kernel:  [<ffffffff80252ebe>] child_rip+0x0/0x12
Mar 17 17:08:35 filer1 kernel:
Mar 17 17:08:35 filer1 kernel:
Mar 17 17:08:35 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 
0a 48 ff c0 48 89 82 80
Mar 17 17:08:35 filer1 kernel: RIP  [<ffffffff8051f81b>] 
cache_clean+0x11e/0x22f
Mar 17 17:08:35 filer1 kernel:  RSP <ffff81007fe11e40>

Notice the same end related to the "cache_clean" function, maybe this 
means something for anyone ?

Additionally, I noticed that today the rpc.mountd process (from 
nfs-utils 1.1.0-rc1) is 350M big as the "ps" command shows :
root      2091  0.0 17.2 364164 354664 ?     Ss   Apr03   0:27 
rpc.mountd --no-nfs-version 2

And it seems to very slowly grow up in memory. /proc/<pid of 
rpc.mountd>/smaps shows that 350M are taken by the "heap" :
00514000-15eea000 rw-p 00514000 00:00 0 
  [heap]
Size:            354136 kB
Rss:             353756 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:   353756 kB

I don't know if these 2 cases are related, but as I got a crash with 
rpc.mountd (see crash from apr. 2nd), I include it too.

Any thoughts ?

Regards,

Gabriel



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mountd randomly crash and panic the server
  2007-04-11 15:11 Gabriel Barazer
@ 2007-04-16  9:26 ` Gabriel Barazer
  2007-04-16 10:47   ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Gabriel Barazer @ 2007-04-16  9:26 UTC (permalink / raw)
  To: nfs

Hello,

I once again got this crash this night. The call trace always end with 
the same function (cache_clean), which makes me think that there is 
maybe a race condition in it (happens randomly, particularly on quite 
heavy load).

Apr 16 00:29:45 filer1 kernel:  [<ffffffff8053ae60>]
cache_flush+0xd/0x23
Apr 16 00:29:45 filer1 kernel:  [<ffffffff805386e0>] 
ip_map_parse+0x17c/0x18e
Apr 16 00:29:45 filer1 kernel:  [<ffffffff8053b4d1>] cache_write+0x90/0xac
Apr 16 00:29:45 filer1 kernel:  [<ffffffff80213f77>] vfs_write+0xaf/0x151
Apr 16 00:29:45 filer1 kernel:  [<ffffffff80214907>] sys_write+0x45/0x6e
Apr 16 00:29:45 filer1 kernel:  [<ffffffff8025311e>] system_call+0x7e/0x83
Apr 16 00:29:45 filer1 kernel:
Apr 16 00:29:45 filer1 kernel:
Apr 16 00:29:45 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 
0a 48 ff c0 48 89 82 80
Apr 16 00:29:45 filer1 kernel: RIP  [<ffffffff8053ad42>] 
cache_clean+0x11e/0x22f

Is this related to the kernel cache_clean function in net/sunrpc/cache.c ?

Gabriel

On 04/11/2007 17:11:58 +0200, Gabriel Barazer <gabriel@barazer.net> wrote:

> Hello there,
> 
> I recently got a problem probably due to rpc.mountd (from nfs-utils 
> 1.0.7). I have a file server running linux 2.6.20.3 and it crashed 4 
> times since march, 1st under a quite heavy load with the following 
> kernel errors in the syslog (which I have read after the crash, because 
> kernel probably panic()ed ano network access was possible)
> 
> Mar 19 19:52:57 filer1 kernel: general protection fault: 0000 [1] SMP
> Mar 19 19:52:57 filer1 kernel: CPU 3
> Mar 19 19:52:57 filer1 kernel: Modules linked in:
> Mar 19 19:52:57 filer1 kernel: Pid: 2238, comm: rpc.mountd Not tainted 2.6.20.3 #1
> Mar 19 19:52:57 filer1 kernel: RIP: 0010:[<ffffffff8053606f>]  [<ffffffff8053606f>] cache_clean+0x11e/0x22f
> Mar 19 19:52:57 filer1 kernel: RSP: 0018:ffff810068da7b68  EFLAGS: 00010206
> Mar 19 19:52:57 filer1 kernel: RAX: ffffffff806a0e80 RBX: 0001e71926010009 RCX: ffffffff806a0e80
> Mar 19 19:52:57 filer1 kernel: RDX: ffffffff806a0e80 RSI: 0000000000000071 RDI: ffffffff806a0e98
> Mar 19 19:52:57 filer1 kernel: RBP: ffff8100090ad7c0 R08: 0000000000000000 R09: 000000000000006b
> Mar 19 19:52:57 filer1 kernel: R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000000
> Mar 19 19:52:57 filer1 kernel: R13: ffff810068da7e08 R14: 0000000000000006 R15: ffff81001669c000
> Mar 19 19:52:57 filer1 kernel: FS:  00002ab0119c36e0(0000) GS:ffff81007ff397c0(0000) knlGS:0000000000000000
> Mar 19 19:52:57 filer1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Mar 19 19:52:57 filer1 kernel: CR2: 00002ab011474000 CR3: 000000007e228000 CR4: 00000000000006e0
> Mar 19 19:52:57 filer1 kernel: Process rpc.mountd (pid: 2238, threadinfo ffff810068da6000, task ffff81006b5a6400)
> Mar 19 19:52:57 filer1 kernel: Stack:  ffff810050a9ae40 0000000000000000 ffff810068da7da8 ffffffff8053618d
> Mar 19 19:52:57 filer1 kernel:  ffff810050a9ae40 ffffffff803489f8 00000000fffffffa ffff810027312840
> Mar 19 19:52:57 filer1 kernel:  ffff810025bb2400 ffff810025bb2400 ffff810068da7c69 ffff810068da7caa
> Mar 19 19:52:57 filer1 kernel: Call Trace:
> Mar 19 19:52:57 filer1 kernel:  [<ffffffff8053618d>] cache_flush+0xd/0x23
> Mar 19 19:52:57 filer1 kernel:  [<ffffffff803489f8>] svc_export_parse+0x5c2/0x650
> Mar 19 19:52:57 filer1 kernel:  [<ffffffff8020d338>] current_fs_time+0x4d/0x52
> Mar 19 19:52:57 filer1 kernel:  [<ffffffff803dc1a5>] elv_next_request+0x40/0x14f
> Mar 19 19:52:57 filer1 kernel:  [<ffffffff80224ccd>] sync_page+0x0/0x41
> Mar 19 19:52:57 filer1 kernel:  [<ffffffff805367fe>] cache_write+0x90/0xac
> Mar 19 19:52:57 filer1 kernel:  [<ffffffff80213f40>] vfs_write+0xaf/0x151
> Mar 19 19:52:57 filer1 kernel:  [<ffffffff802148d0>] sys_write+0x45/0x6e
> Mar 19 19:52:57 filer1 kernel:  [<ffffffff8025411e>] system_call+0x7e/0x83
> Mar 19 19:52:57 filer1 kernel:
> Mar 19 19:52:57 filer1 kernel:
> Mar 19 19:52:57 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 0a 48 ff c0 48 89 82 80
> Mar 19 19:52:57 filer1 kernel: RIP  [<ffffffff8053606f>] cache_clean+0x11e/0x22f
> Mar 19 19:52:57 filer1 kernel:  RSP <ffff810068da7b68>
> 
> The three other times , there was this error :
> 
> Apr  2 18:30:00 filer1 kernel: general protection fault: 0000 [1] SMP
> Apr  2 18:30:00 filer1 kernel: CPU 2
> Apr  2 18:30:00 filer1 kernel: Modules linked in:
> Apr  2 18:30:00 filer1 kernel: Pid: 891, comm: rpc.mountd Not tainted 2.6.20.3 #1
> Apr  2 18:30:00 filer1 kernel: RIP: 0010:[<ffffffff8053606f>]  [<ffffffff8053606f>] cache_clean+0x11e/0x22f
> Apr  2 18:30:00 filer1 kernel: RSP: 0018:ffff8100778ffe18  EFLAGS: 00010202
> Apr  2 18:30:00 filer1 kernel: RAX: ffffffff806a0e80 RBX: 2d305f742d315f61 RCX: ffffffff806a0e80
> Apr  2 18:30:00 filer1 kernel: RDX: ffffffff806a0e80 RSI: 0000000000000070 RDI: ffffffff806a0e98
> Apr  2 18:30:00 filer1 kernel: RBP: ffff810008664c40 R08: ffffffff807783a0 R09: 00000000000000fc
> Apr  2 18:30:00 filer1 kernel: R10: 000000000000000b R11: 0000000000000027 R12: 0000000000000000
> Apr  2 18:30:00 filer1 kernel: R13: ffff8100778ffe78 R14: 0000000000000001 R15: 000000000040c62b
> Apr  2 18:30:00 filer1 kernel: FS:  00002abf4f2526e0(0000) GS:ffff81007ff39e40(0000) knlGS:0000000000000000
> Apr  2 18:30:00 filer1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Apr  2 18:30:00 filer1 kernel: CR2: 00002b3251f83000 CR3: 000000003266e000 CR4: 00000000000006e0
> Apr  2 18:30:00 filer1 kernel: Process rpc.mountd (pid: 891, threadinfo ffff8100778fe000, task ffff810077fd03c0)
> Apr  2 18:30:00 filer1 kernel: Stack:  ffff810079cca840 0000000000000000 000000004611368f ffffffff8053618d
> Apr  2 18:30:00 filer1 kernel:  0000000000000000 ffffffff80533a01 ffff8100778ffe77 0000000002cd00e3
> Apr  2 18:30:00 filer1 kernel:  ffff8100778ffe92 0000000a000081a4 0000000000000000 000000000000002f
> Apr  2 18:30:00 filer1 kernel: Call Trace:
> Apr  2 18:30:00 filer1 kernel:  [<ffffffff8053618d>] cache_flush+0xd/0x23
> Apr  2 18:30:00 filer1 kernel:  [<ffffffff80533a01>] ip_map_parse+0x17c/0x18e
> Apr  2 18:30:00 filer1 kernel:  [<ffffffff805367fe>] cache_write+0x90/0xac
> Apr  2 18:30:00 filer1 kernel:  [<ffffffff80213f40>] vfs_write+0xaf/0x151
> Apr  2 18:30:00 filer1 kernel:  [<ffffffff802148d0>] sys_write+0x45/0x6e
> Apr  2 18:30:00 filer1 kernel:  [<ffffffff8025411e>] system_call+0x7e/0x83
> Apr  2 18:30:00 filer1 kernel:
> Apr  2 18:30:00 filer1 kernel:
> Apr  2 18:30:00 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 0a 48 ff c0 48 89 82 80
> Apr  2 18:30:00 filer1 kernel: RIP  [<ffffffff8053606f>] cache_clean+0x11e/0x22f
> Apr  2 18:30:00 filer1 kernel:  RSP <ffff8100778ffe18>
> 
> And this one (only call trace included):
> 
> Mar 17 17:08:34 filer1 kernel: general protection fault: 0000 [1] SMP
> Mar 17 17:08:34 filer1 kernel: CPU 3
> Mar 17 17:08:34 filer1 kernel: Pid: 13, comm: events/3 Not tainted 2.6.20 #5
> Mar 17 17:08:34 filer1 kernel: RIP: 0010:[<ffffffff8051f81b>]  [<ffffffff8051f81b>] cache_clean+0x11e/0x22f
> Mar 17 17:08:34 filer1 kernel: RSP: 0018:ffff81007fe11e40  EFLAGS: 00010202
> Mar 17 17:08:34 filer1 kernel: RAX: ffffffff80660080 RBX: 2d305f742d315f61 RCX: ffffffff80660080
> Mar 17 17:08:34 filer1 kernel: RDX: ffffffff80660080 RSI: 0000000000000077 RDI: ffffffff80660098
> Mar 17 17:08:34 filer1 kernel: RBP: ffff81001e9b5e40 R08: ffff81007fe10000 R09: 0000000000000286
> Mar 17 17:08:34 filer1 kernel: R10: 0000000000000286 R11: ffffffff80240992 R12: 0000000000000000
> Mar 17 17:08:34 filer1 kernel: R13: 0000000000000206 R14: ffffffff8051fab1 R15: 0000000000000000
> Mar 17 17:08:34 filer1 kernel: FS:  0000000000000000(0000) GS:ffff810002f817c0(0000) knlGS:0000000000000000
> Mar 17 17:08:34 filer1 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> Mar 17 17:08:34 filer1 kernel: CR2: 00002b6b14d8bbc0 CR3: 000000007b57c000 CR4: 00000000000006e0
> Mar 17 17:08:34 filer1 kernel: Process events/3 (pid: 13, threadinfo ffff81007fe10000, task ffff81007fe3ef00)
> Mar 17 17:08:34 filer1 kernel: Stack:  0000000000000005 ffff810002f815c0 ffffffff806602c0 ffffffff8051fabc
> Mar 17 17:08:34 filer1 kernel:  ffffffff806602c8 ffffffff80244799 ffff810002f815c0 ffffffff802415b3
> Mar 17 17:08:34 filer1 kernel:  ffff81007ff23d60 00000000fffffffc 0000000000000000 ffffffff802416d7
> Mar 17 17:08:34 filer1 kernel: Call Trace:
> Mar 17 17:08:34 filer1 kernel:  [<ffffffff8051fabc>] do_cache_clean+0xb/0x38
> Mar 17 17:08:34 filer1 kernel:  [<ffffffff80244799>] run_workqueue+0x9f/0x13b
> Mar 17 17:08:34 filer1 kernel:  [<ffffffff802415b3>] worker_thread+0x0/0x15a
> Mar 17 17:08:34 filer1 kernel:  [<ffffffff802416d7>] worker_thread+0x124/0x15a
> Mar 17 17:08:34 filer1 kernel:  [<ffffffff802731fd>] default_wake_function+0x0/0xe
> Mar 17 17:08:34 filer1 kernel:  [<ffffffff802731fd>] default_wake_function+0x0/0xe
> Mar 17 17:08:35 filer1 kernel:  [<ffffffff8022d1c6>] kthread+0xc8/0xf1
> Mar 17 17:08:35 filer1 kernel:  [<ffffffff80252ec8>] child_rip+0xa/0x12
> Mar 17 17:08:35 filer1 kernel:  [<ffffffff802838e8>] kthread_create+0x6a/0x147
> Mar 17 17:08:35 filer1 kernel:  [<ffffffff8022d0fe>] kthread+0x0/0xf1
> Mar 17 17:08:35 filer1 kernel:  [<ffffffff80252ebe>] child_rip+0x0/0x12
> Mar 17 17:08:35 filer1 kernel:
> Mar 17 17:08:35 filer1 kernel:
> Mar 17 17:08:35 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 0a 48 ff c0 48 89 82 80
> Mar 17 17:08:35 filer1 kernel: RIP  [<ffffffff8051f81b>] cache_clean+0x11e/0x22f
> Mar 17 17:08:35 filer1 kernel:  RSP <ffff81007fe11e40>
> 
> Notice the same end related to the "cache_clean" function, maybe this means something for anyone ?
> 
> Additionally, I noticed that today the rpc.mountd process (from nfs-utils 1.1.0-rc1) is 350M big as the "ps" command shows :
> root      2091  0.0 17.2 364164 354664 ?     Ss   Apr03   0:27 rpc.mountd --no-nfs-version 2
> 
> And it seems to very slowly grow up in memory. /proc/<pid of rpc.mountd>/smaps shows that 350M are taken by the "heap" :
> 00514000-15eea000 rw-p 00514000 00:00 0                                  [heap]
> Size:            354136 kB
> Rss:             353756 kB
> Shared_Clean:         0 kB
> Shared_Dirty:         0 kB
> Private_Clean:        0 kB
> Private_Dirty:   353756 kB
> 
> I don't know if these 2 cases are related, but as I got a crash with rpc.mountd (see crash from apr. 2nd), I include it too.
> 
> Any thoughts ?
> 
> Regards,
> 
> Gabriel
> 
> 
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> NFS maillist  -  NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mountd randomly crash and panic the server
  2007-04-16  9:26 ` Gabriel Barazer
@ 2007-04-16 10:47   ` Neil Brown
  2007-04-16 14:07     ` Gabriel Barazer
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2007-04-16 10:47 UTC (permalink / raw)
  To: Gabriel Barazer; +Cc: nfs

On Monday April 16, gabriel@oxeva.fr wrote:
> Hello,
> 
> I once again got this crash this night. The call trace always end with 
> the same function (cache_clean), which makes me think that there is 
> maybe a race condition in it (happens randomly, particularly on quite 
> heavy load).
> 
> Apr 16 00:29:45 filer1 kernel:  [<ffffffff8053ae60>]
> cache_flush+0xd/0x23
> Apr 16 00:29:45 filer1 kernel:  [<ffffffff805386e0>] 
> ip_map_parse+0x17c/0x18e
> Apr 16 00:29:45 filer1 kernel:  [<ffffffff8053b4d1>] cache_write+0x90/0xac
> Apr 16 00:29:45 filer1 kernel:  [<ffffffff80213f77>] vfs_write+0xaf/0x151
> Apr 16 00:29:45 filer1 kernel:  [<ffffffff80214907>] sys_write+0x45/0x6e
> Apr 16 00:29:45 filer1 kernel:  [<ffffffff8025311e>] system_call+0x7e/0x83
> Apr 16 00:29:45 filer1 kernel:
> Apr 16 00:29:45 filer1 kernel:
> Apr 16 00:29:45 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 
> 0a 48 ff c0 48 89 82 80
> Apr 16 00:29:45 filer1 kernel: RIP  [<ffffffff8053ad42>] 
> cache_clean+0x11e/0x22f
> 
> Is this related to the kernel cache_clean function in net/sunrpc/cache.c ?

Yes.  It is crashing at:
		for (; ch; cp= & ch->next, ch= *cp) {
			if (current_detail->nextcheck > ch->expiry_time)
                                                        ^^^^^^
				current_detail->nextcheck = ch->expiry_time+1;
			if (ch->expiry_time >= get_seconds()

ch has a garbage value (0001e71926010009), presumable because a
previous cp had been corrupted, most likely by being freed while still
in use.

Bother.

I'll see if I can figure out what is happening.

Thanks for the report.

NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mountd randomly crash and panic the server
  2007-04-16 10:47   ` Neil Brown
@ 2007-04-16 14:07     ` Gabriel Barazer
  2007-04-17  1:23       ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Gabriel Barazer @ 2007-04-16 14:07 UTC (permalink / raw)
  To: Neil Brown; +Cc: nfs

On 04/16/2007 12:47:32 +0200, Neil Brown <neilb@suse.de> wrote:

> On Monday April 16, gabriel@oxeva.fr wrote:

>>
>> Is this related to the kernel cache_clean function in net/sunrpc/cache.c ?
> 
> Yes.  It is crashing at:
> 		for (; ch; cp= & ch->next, ch= *cp) {
> 			if (current_detail->nextcheck > ch->expiry_time)
>                                                         ^^^^^^
> 				current_detail->nextcheck = ch->expiry_time+1;
> 			if (ch->expiry_time >= get_seconds()
> 
> ch has a garbage value (0001e71926010009), presumable because a
> previous cp had been corrupted, most likely by being freed while still
> in use.
> 

Maybe this can helps : I never encountered the problem with kernel 
2.6.18, maybe this can restrict the scope for where is this corruption 
happening ?

> Bother.
> 
> I'll see if I can figure out what is happening.
> 
> Thanks for the report.
> 
> NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mountd randomly crash and panic the server
  2007-04-16 14:07     ` Gabriel Barazer
@ 2007-04-17  1:23       ` Neil Brown
  2007-04-30 23:46         ` Gabriel Barazer
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2007-04-17  1:23 UTC (permalink / raw)
  To: Gabriel Barazer; +Cc: nfs

On Monday April 16, gabriel@oxeva.fr wrote:
> On 04/16/2007 12:47:32 +0200, Neil Brown <neilb@suse.de> wrote:
> 
> > On Monday April 16, gabriel@oxeva.fr wrote:
> 
> >>
> >> Is this related to the kernel cache_clean function in net/sunrpc/cache.c ?
> > 
> > Yes.  It is crashing at:
> > 		for (; ch; cp= & ch->next, ch= *cp) {
> > 			if (current_detail->nextcheck > ch->expiry_time)
> >                                                         ^^^^^^
> > 				current_detail->nextcheck = ch->expiry_time+1;
> > 			if (ch->expiry_time >= get_seconds()
> > 
> > ch has a garbage value (0001e71926010009), presumable because a
> > previous cp had been corrupted, most likely by being freed while still
> > in use.
> > 
> 
> Maybe this can helps : I never encountered the problem with kernel 
> 2.6.18, maybe this can restrict the scope for where is this corruption 
> happening ?

I did end up just hunting through the patches between 2.6.18 and
2.6.20 and I think I have found it (I had already looked at the code
where the bug is, but didn't see it the first time :-().

Could you try this patch and report the results please?
I'm very confident that the patch is correct and required, but I would
like to also know that it fixes your problem.

Thanks
NeilBrown

---------------------------------------------
Use a spinlock to protect sk_info_authunix

sk_info_authunix is not being protected properly so the object that
it points to can be cache_put twice, leading to corruption.

We borrow svsk->sk_defer_lock to provide the protection.  We should probably
rename that lock to have a more generic name.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./net/sunrpc/svcauth_unix.c |   21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff .prev/net/sunrpc/svcauth_unix.c ./net/sunrpc/svcauth_unix.c
--- .prev/net/sunrpc/svcauth_unix.c	2007-04-17 11:18:44.000000000 +1000
+++ ./net/sunrpc/svcauth_unix.c	2007-04-17 11:19:18.000000000 +1000
@@ -383,7 +383,10 @@ void svcauth_unix_purge(void)
 static inline struct ip_map *
 ip_map_cached_get(struct svc_rqst *rqstp)
 {
-	struct ip_map *ipm = rqstp->rq_sock->sk_info_authunix;
+	struct ip_map *ipm;
+	struct svc_sock *svsk = rqstp->rq_sock;
+	spin_lock(&svsk->sk_defer_lock);
+	ipm = svsk->sk_info_authunix;
 	if (ipm != NULL) {
 		if (!cache_valid(&ipm->h)) {
 			/*
@@ -391,12 +394,14 @@ ip_map_cached_get(struct svc_rqst *rqstp
 			 * remembered, e.g. by a second mount from the
 			 * same IP address.
 			 */
-			rqstp->rq_sock->sk_info_authunix = NULL;
+			svsk->sk_info_authunix = NULL;
+			spin_unlock(&svsk->sk_defer_lock);
 			cache_put(&ipm->h, &ip_map_cache);
 			return NULL;
 		}
 		cache_get(&ipm->h);
 	}
+	spin_unlock(&svsk->sk_defer_lock);
 	return ipm;
 }
 
@@ -405,9 +410,15 @@ ip_map_cached_put(struct svc_rqst *rqstp
 {
 	struct svc_sock *svsk = rqstp->rq_sock;
 
-	if (svsk->sk_sock->type == SOCK_STREAM && svsk->sk_info_authunix == NULL)
-		svsk->sk_info_authunix = ipm;	/* newly cached, keep the reference */
-	else
+	spin_lock(&svsk->sk_defer_lock);
+	if (svsk->sk_sock->type == SOCK_STREAM &&
+	    svsk->sk_info_authunix == NULL) {
+		/* newly cached, keep the reference */
+		svsk->sk_info_authunix = ipm;
+		ipm = NULL;
+	}
+	spin_unlock(&svsk->sk_defer_lock);
+	if (ipm)
 		cache_put(&ipm->h, &ip_map_cache);
 }
 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mountd randomly crash and panic the server
  2007-04-17  1:23       ` Neil Brown
@ 2007-04-30 23:46         ` Gabriel Barazer
  2007-04-30 23:59           ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Gabriel Barazer @ 2007-04-30 23:46 UTC (permalink / raw)
  To: Neil Brown; +Cc: nfs

On 04/17/2007 3:23:40 +0200, Neil Brown <neilb@suse.de> wrote:

> 
> I did end up just hunting through the patches between 2.6.18 and
> 2.6.20 and I think I have found it (I had already looked at the code
> where the bug is, but didn't see it the first time :-().
> 
> Could you try this patch and report the results please?
> I'm very confident that the patch is correct and required, but I would
> like to also know that it fixes your problem.
> 

By the way, I patched the 2.6.20.7 kernel since you provided the patch, 
and haven't encountered any stability problem, except some CFQ scheduler 
  related kernel panics, which are, I suppose, not linked to the NFS 
problem (I still mention it because unrelated but linked problems like 
this sometimes happens)

Thanks for the patch !

Gabriel Barazer

> Thanks
> NeilBrown
> 
> ---------------------------------------------
> Use a spinlock to protect sk_info_authunix
> 
> sk_info_authunix is not being protected properly so the object that
> it points to can be cache_put twice, leading to corruption.
> 
> We borrow svsk->sk_defer_lock to provide the protection.  We should probably
> rename that lock to have a more generic name.
> 
> Signed-off-by: Neil Brown <neilb@suse.de>
> 
> ### Diffstat output
>  ./net/sunrpc/svcauth_unix.c |   21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff .prev/net/sunrpc/svcauth_unix.c ./net/sunrpc/svcauth_unix.c
> --- .prev/net/sunrpc/svcauth_unix.c	2007-04-17 11:18:44.000000000 +1000
> +++ ./net/sunrpc/svcauth_unix.c	2007-04-17 11:19:18.000000000 +1000
> @@ -383,7 +383,10 @@ void svcauth_unix_purge(void)
>  static inline struct ip_map *
>  ip_map_cached_get(struct svc_rqst *rqstp)
>  {
> -	struct ip_map *ipm = rqstp->rq_sock->sk_info_authunix;
> +	struct ip_map *ipm;
> +	struct svc_sock *svsk = rqstp->rq_sock;
> +	spin_lock(&svsk->sk_defer_lock);
> +	ipm = svsk->sk_info_authunix;
>  	if (ipm != NULL) {
>  		if (!cache_valid(&ipm->h)) {
>  			/*
> @@ -391,12 +394,14 @@ ip_map_cached_get(struct svc_rqst *rqstp
>  			 * remembered, e.g. by a second mount from the
>  			 * same IP address.
>  			 */
> -			rqstp->rq_sock->sk_info_authunix = NULL;
> +			svsk->sk_info_authunix = NULL;
> +			spin_unlock(&svsk->sk_defer_lock);
>  			cache_put(&ipm->h, &ip_map_cache);
>  			return NULL;
>  		}
>  		cache_get(&ipm->h);
>  	}
> +	spin_unlock(&svsk->sk_defer_lock);
>  	return ipm;
>  }
>  
> @@ -405,9 +410,15 @@ ip_map_cached_put(struct svc_rqst *rqstp
>  {
>  	struct svc_sock *svsk = rqstp->rq_sock;
>  
> -	if (svsk->sk_sock->type == SOCK_STREAM && svsk->sk_info_authunix == NULL)
> -		svsk->sk_info_authunix = ipm;	/* newly cached, keep the reference */
> -	else
> +	spin_lock(&svsk->sk_defer_lock);
> +	if (svsk->sk_sock->type == SOCK_STREAM &&
> +	    svsk->sk_info_authunix == NULL) {
> +		/* newly cached, keep the reference */
> +		svsk->sk_info_authunix = ipm;
> +		ipm = NULL;
> +	}
> +	spin_unlock(&svsk->sk_defer_lock);
> +	if (ipm)
>  		cache_put(&ipm->h, &ip_map_cache);
>  }
>  

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mountd randomly crash and panic the server
  2007-04-30 23:46         ` Gabriel Barazer
@ 2007-04-30 23:59           ` Neil Brown
  2007-05-02  9:03             ` Gabriel Barazer
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2007-04-30 23:59 UTC (permalink / raw)
  To: Gabriel Barazer; +Cc: nfs

On Tuesday May 1, gabriel@oxeva.fr wrote:
> On 04/17/2007 3:23:40 +0200, Neil Brown <neilb@suse.de> wrote:
> 
> > 
> > I did end up just hunting through the patches between 2.6.18 and
> > 2.6.20 and I think I have found it (I had already looked at the code
> > where the bug is, but didn't see it the first time :-().
> > 
> > Could you try this patch and report the results please?
> > I'm very confident that the patch is correct and required, but I would
> > like to also know that it fixes your problem.
> > 
> 
> By the way, I patched the 2.6.20.7 kernel since you provided the patch, 
> and haven't encountered any stability problem, except some CFQ scheduler 

Cool, thanks for the report.

>   related kernel panics, which are, I suppose, not linked to the NFS 
> problem (I still mention it because unrelated but linked problems like 
> this sometimes happens)

CFQ panics ?  In cfq_dispatch_insert by any chance?

This patch from Jens Axboe might be what you need.

NeilBrown

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9e37971..f965be7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -532,6 +532,11 @@ static void cfq_add_rq_rb(struct request *rq)
 
 	if (!cfq_cfqq_on_rr(cfqq))
 		cfq_add_cfqq_rr(cfqd, cfqq);
+
+	/*
+	 * check if this request is a better next-serve candidate
+	 */
+	cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
 }
 
 static inline void
@@ -1639,12 +1644,6 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		cfqq->meta_pending++;
 
 	/*
-	 * check if this request is a better next-serve candidate)) {
-	 */
-	cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
-	BUG_ON(!cfqq->next_rq);
-
-	/*
 	 * we never wait for an async request and we don't allow preemption
 	 * of an async request. so just return early
 	 */


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: mountd randomly crash and panic the server
  2007-04-30 23:59           ` Neil Brown
@ 2007-05-02  9:03             ` Gabriel Barazer
  2007-05-02 11:29               ` Neil Brown
  0 siblings, 1 reply; 10+ messages in thread
From: Gabriel Barazer @ 2007-05-02  9:03 UTC (permalink / raw)
  To: Neil Brown; +Cc: nfs

On 05/01/2007 1:59:46 +0200, Neil Brown <neilb@suse.de> wrote:
> 
>>   related kernel panics, which are, I suppose, not linked to the NFS 
>> problem (I still mention it because unrelated but linked problems like 
>> this sometimes happens)
> 
> CFQ panics ?  In cfq_dispatch_insert by any chance?
> 
> This patch from Jens Axboe might be what you need.
> 

The problem is exactly related to this function : Is this bug related to 
the NFS subsystem, or am I lucky and it appears you work on the CFQ 
scheduler as well ? :) I can test this patch, but since this bug panic() 
the kernel, I can't afford to hang the file server very often.

For now I have changed the scheduler to deadline with the 
/sys/block/<device>/queue/scheduler file. Is this patch already sent to 
the upstream kernel sources ?

Thanks,

Gabriel

> NeilBrown
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 9e37971..f965be7 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -532,6 +532,11 @@ static void cfq_add_rq_rb(struct request *rq)
>  
>  	if (!cfq_cfqq_on_rr(cfqq))
>  		cfq_add_cfqq_rr(cfqd, cfqq);
> +
> +	/*
> +	 * check if this request is a better next-serve candidate
> +	 */
> +	cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
>  }
>  
>  static inline void
> @@ -1639,12 +1644,6 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  		cfqq->meta_pending++;
>  
>  	/*
> -	 * check if this request is a better next-serve candidate)) {
> -	 */
> -	cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
> -	BUG_ON(!cfqq->next_rq);
> -
> -	/*
>  	 * we never wait for an async request and we don't allow preemption
>  	 * of an async request. so just return early
>  	 */
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mountd randomly crash and panic the server
  2007-05-02  9:03             ` Gabriel Barazer
@ 2007-05-02 11:29               ` Neil Brown
  0 siblings, 0 replies; 10+ messages in thread
From: Neil Brown @ 2007-05-02 11:29 UTC (permalink / raw)
  To: Gabriel Barazer; +Cc: nfs

On Wednesday May 2, gabriel@oxeva.fr wrote:
> On 05/01/2007 1:59:46 +0200, Neil Brown <neilb@suse.de> wrote:
> > 
> >>   related kernel panics, which are, I suppose, not linked to the NFS 
> >> problem (I still mention it because unrelated but linked problems like 
> >> this sometimes happens)
> > 
> > CFQ panics ?  In cfq_dispatch_insert by any chance?
> > 
> > This patch from Jens Axboe might be what you need.
> > 
> 
> The problem is exactly related to this function : Is this bug related to 
> the NFS subsystem, or am I lucky and it appears you work on the CFQ 
> scheduler as well ? :) I can test this patch, but since this bug panic() 
> the kernel, I can't afford to hang the file server very often.

I don't normally work on the CFQ scheduler, but I do work on md/raid,
and the people who had reported this bug were all using md/raid, so I
looked into it.

The patch definitely fixes the problem.

> 
> For now I have changed the scheduler to deadline with the 
> /sys/block/<device>/queue/scheduler file. Is this patch already sent to 
> the upstream kernel sources ?

Yes, it is in 2.6.21.

NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-05-02 11:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-11 15:12 mountd randomly crash and panic the server Gabriel Barazer
  -- strict thread matches above, loose matches on Subject: below --
2007-04-11 15:11 Gabriel Barazer
2007-04-16  9:26 ` Gabriel Barazer
2007-04-16 10:47   ` Neil Brown
2007-04-16 14:07     ` Gabriel Barazer
2007-04-17  1:23       ` Neil Brown
2007-04-30 23:46         ` Gabriel Barazer
2007-04-30 23:59           ` Neil Brown
2007-05-02  9:03             ` Gabriel Barazer
2007-05-02 11:29               ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.