All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gabriel Barazer <gabriel@oxeva.fr>
To: nfs@lists.sourceforge.net
Subject: mountd randomly crash and panic the server
Date: Wed, 11 Apr 2007 17:12:26 +0200	[thread overview]
Message-ID: <461CFADA.1080305@oxeva.fr> (raw)

Hello there,

I recently got a problem probably due to rpc.mountd (from nfs-utils
1.0.7). I have a file server running linux 2.6.20.3 and it crashed 4
times since march, 1st under a quite heavy load with the following
kernel errors in the syslog (which I have read after the crash, because
kernel probably panic()ed ano network access was possible)

Mar 19 19:52:57 filer1 kernel: general protection fault: 0000 [1] SMP
Mar 19 19:52:57 filer1 kernel: CPU 3
Mar 19 19:52:57 filer1 kernel: Modules linked in:
Mar 19 19:52:57 filer1 kernel: Pid: 2238, comm: rpc.mountd Not tainted 
2.6.20.3 #1
Mar 19 19:52:57 filer1 kernel: RIP: 0010:[<ffffffff8053606f>] 
[<ffffffff8053606f>] cache_clean+0x11e/0x22f
Mar 19 19:52:57 filer1 kernel: RSP: 0018:ffff810068da7b68  EFLAGS: 00010206
Mar 19 19:52:57 filer1 kernel: RAX: ffffffff806a0e80 RBX: 
0001e71926010009 RCX: ffffffff806a0e80
Mar 19 19:52:57 filer1 kernel: RDX: ffffffff806a0e80 RSI: 
0000000000000071 RDI: ffffffff806a0e98
Mar 19 19:52:57 filer1 kernel: RBP: ffff8100090ad7c0 R08: 
0000000000000000 R09: 000000000000006b
Mar 19 19:52:57 filer1 kernel: R10: 0000000000000000 R11: 
0000000000000005 R12: 0000000000000000
Mar 19 19:52:57 filer1 kernel: R13: ffff810068da7e08 R14: 
0000000000000006 R15: ffff81001669c000
Mar 19 19:52:57 filer1 kernel: FS:  00002ab0119c36e0(0000) 
GS:ffff81007ff397c0(0000) knlGS:0000000000000000
Mar 19 19:52:57 filer1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Mar 19 19:52:57 filer1 kernel: CR2: 00002ab011474000 CR3: 
000000007e228000 CR4: 00000000000006e0
Mar 19 19:52:57 filer1 kernel: Process rpc.mountd (pid: 2238, threadinfo 
ffff810068da6000, task ffff81006b5a6400)
Mar 19 19:52:57 filer1 kernel: Stack:  ffff810050a9ae40 0000000000000000 
ffff810068da7da8 ffffffff8053618d
Mar 19 19:52:57 filer1 kernel:  ffff810050a9ae40 ffffffff803489f8 
00000000fffffffa ffff810027312840
Mar 19 19:52:57 filer1 kernel:  ffff810025bb2400 ffff810025bb2400 
ffff810068da7c69 ffff810068da7caa
Mar 19 19:52:57 filer1 kernel: Call Trace:
Mar 19 19:52:57 filer1 kernel:  [<ffffffff8053618d>] cache_flush+0xd/0x23
Mar 19 19:52:57 filer1 kernel:  [<ffffffff803489f8>] 
svc_export_parse+0x5c2/0x650
Mar 19 19:52:57 filer1 kernel:  [<ffffffff8020d338>] 
current_fs_time+0x4d/0x52
Mar 19 19:52:57 filer1 kernel:  [<ffffffff803dc1a5>] 
elv_next_request+0x40/0x14f
Mar 19 19:52:57 filer1 kernel:  [<ffffffff80224ccd>] sync_page+0x0/0x41
Mar 19 19:52:57 filer1 kernel:  [<ffffffff805367fe>] cache_write+0x90/0xac
Mar 19 19:52:57 filer1 kernel:  [<ffffffff80213f40>] vfs_write+0xaf/0x151
Mar 19 19:52:57 filer1 kernel:  [<ffffffff802148d0>] sys_write+0x45/0x6e
Mar 19 19:52:57 filer1 kernel:  [<ffffffff8025411e>] system_call+0x7e/0x83
Mar 19 19:52:57 filer1 kernel:
Mar 19 19:52:57 filer1 kernel:
Mar 19 19:52:57 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 
0a 48 ff c0 48 89 82 80
Mar 19 19:52:57 filer1 kernel: RIP  [<ffffffff8053606f>] 
cache_clean+0x11e/0x22f
Mar 19 19:52:57 filer1 kernel:  RSP <ffff810068da7b68>

The three other times , there was this error :

Apr  2 18:30:00 filer1 kernel: general protection fault: 0000 [1] SMP
Apr  2 18:30:00 filer1 kernel: CPU 2
Apr  2 18:30:00 filer1 kernel: Modules linked in:
Apr  2 18:30:00 filer1 kernel: Pid: 891, comm: rpc.mountd Not tainted 
2.6.20.3 #1
Apr  2 18:30:00 filer1 kernel: RIP: 0010:[<ffffffff8053606f>] 
[<ffffffff8053606f>] cache_clean+0x11e/0x22f
Apr  2 18:30:00 filer1 kernel: RSP: 0018:ffff8100778ffe18  EFLAGS: 00010202
Apr  2 18:30:00 filer1 kernel: RAX: ffffffff806a0e80 RBX: 
2d305f742d315f61 RCX: ffffffff806a0e80
Apr  2 18:30:00 filer1 kernel: RDX: ffffffff806a0e80 RSI: 
0000000000000070 RDI: ffffffff806a0e98
Apr  2 18:30:00 filer1 kernel: RBP: ffff810008664c40 R08: 
ffffffff807783a0 R09: 00000000000000fc
Apr  2 18:30:00 filer1 kernel: R10: 000000000000000b R11: 
0000000000000027 R12: 0000000000000000
Apr  2 18:30:00 filer1 kernel: R13: ffff8100778ffe78 R14: 
0000000000000001 R15: 000000000040c62b
Apr  2 18:30:00 filer1 kernel: FS:  00002abf4f2526e0(0000) 
GS:ffff81007ff39e40(0000) knlGS:0000000000000000
Apr  2 18:30:00 filer1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Apr  2 18:30:00 filer1 kernel: CR2: 00002b3251f83000 CR3: 
000000003266e000 CR4: 00000000000006e0
Apr  2 18:30:00 filer1 kernel: Process rpc.mountd (pid: 891, threadinfo 
ffff8100778fe000, task ffff810077fd03c0)
Apr  2 18:30:00 filer1 kernel: Stack:  ffff810079cca840 0000000000000000 
000000004611368f ffffffff8053618d
Apr  2 18:30:00 filer1 kernel:  0000000000000000 ffffffff80533a01 
ffff8100778ffe77 0000000002cd00e3
Apr  2 18:30:00 filer1 kernel:  ffff8100778ffe92 0000000a000081a4 
0000000000000000 000000000000002f
Apr  2 18:30:00 filer1 kernel: Call Trace:
Apr  2 18:30:00 filer1 kernel:  [<ffffffff8053618d>] cache_flush+0xd/0x23
Apr  2 18:30:00 filer1 kernel:  [<ffffffff80533a01>] 
ip_map_parse+0x17c/0x18e
Apr  2 18:30:00 filer1 kernel:  [<ffffffff805367fe>] cache_write+0x90/0xac
Apr  2 18:30:00 filer1 kernel:  [<ffffffff80213f40>] vfs_write+0xaf/0x151
Apr  2 18:30:00 filer1 kernel:  [<ffffffff802148d0>] sys_write+0x45/0x6e
Apr  2 18:30:00 filer1 kernel:  [<ffffffff8025411e>] system_call+0x7e/0x83
Apr  2 18:30:00 filer1 kernel:
Apr  2 18:30:00 filer1 kernel:
Apr  2 18:30:00 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 
0a 48 ff c0 48 89 82 80
Apr  2 18:30:00 filer1 kernel: RIP  [<ffffffff8053606f>] 
cache_clean+0x11e/0x22f
Apr  2 18:30:00 filer1 kernel:  RSP <ffff8100778ffe18>

And this one (only call trace included):

Mar 17 17:08:34 filer1 kernel: general protection fault: 0000 [1] SMP
Mar 17 17:08:34 filer1 kernel: CPU 3
Mar 17 17:08:34 filer1 kernel: Pid: 13, comm: events/3 Not tainted 2.6.20 #5
Mar 17 17:08:34 filer1 kernel: RIP: 0010:[<ffffffff8051f81b>] 
[<ffffffff8051f81b>] cache_clean+0x11e/0x22f
Mar 17 17:08:34 filer1 kernel: RSP: 0018:ffff81007fe11e40  EFLAGS: 00010202
Mar 17 17:08:34 filer1 kernel: RAX: ffffffff80660080 RBX: 
2d305f742d315f61 RCX: ffffffff80660080
Mar 17 17:08:34 filer1 kernel: RDX: ffffffff80660080 RSI: 
0000000000000077 RDI: ffffffff80660098
Mar 17 17:08:34 filer1 kernel: RBP: ffff81001e9b5e40 R08: 
ffff81007fe10000 R09: 0000000000000286
Mar 17 17:08:34 filer1 kernel: R10: 0000000000000286 R11: 
ffffffff80240992 R12: 0000000000000000
Mar 17 17:08:34 filer1 kernel: R13: 0000000000000206 R14: 
ffffffff8051fab1 R15: 0000000000000000
Mar 17 17:08:34 filer1 kernel: FS:  0000000000000000(0000) 
GS:ffff810002f817c0(0000) knlGS:0000000000000000
Mar 17 17:08:34 filer1 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
000000008005003b
Mar 17 17:08:34 filer1 kernel: CR2: 00002b6b14d8bbc0 CR3: 
000000007b57c000 CR4: 00000000000006e0
Mar 17 17:08:34 filer1 kernel: Process events/3 (pid: 13, threadinfo 
ffff81007fe10000, task ffff81007fe3ef00)
Mar 17 17:08:34 filer1 kernel: Stack:  0000000000000005 ffff810002f815c0 
ffffffff806602c0 ffffffff8051fabc
Mar 17 17:08:34 filer1 kernel:  ffffffff806602c8 ffffffff80244799 
ffff810002f815c0 ffffffff802415b3
Mar 17 17:08:34 filer1 kernel:  ffff81007ff23d60 00000000fffffffc 
0000000000000000 ffffffff802416d7
Mar 17 17:08:34 filer1 kernel: Call Trace:
Mar 17 17:08:34 filer1 kernel:  [<ffffffff8051fabc>] do_cache_clean+0xb/0x38
Mar 17 17:08:34 filer1 kernel:  [<ffffffff80244799>] 
run_workqueue+0x9f/0x13b
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802415b3>] worker_thread+0x0/0x15a
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802416d7>] 
worker_thread+0x124/0x15a
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802731fd>] 
default_wake_function+0x0/0xe
Mar 17 17:08:34 filer1 kernel:  [<ffffffff802731fd>] 
default_wake_function+0x0/0xe
Mar 17 17:08:35 filer1 kernel:  [<ffffffff8022d1c6>] kthread+0xc8/0xf1
Mar 17 17:08:35 filer1 kernel:  [<ffffffff80252ec8>] child_rip+0xa/0x12
Mar 17 17:08:35 filer1 kernel:  [<ffffffff802838e8>] 
kthread_create+0x6a/0x147
Mar 17 17:08:35 filer1 kernel:  [<ffffffff8022d0fe>] kthread+0x0/0xf1
Mar 17 17:08:35 filer1 kernel:  [<ffffffff80252ebe>] child_rip+0x0/0x12
Mar 17 17:08:35 filer1 kernel:
Mar 17 17:08:35 filer1 kernel:
Mar 17 17:08:35 filer1 kernel: Code: 48 8b 43 08 48 39 82 80 00 00 00 7e 
0a 48 ff c0 48 89 82 80
Mar 17 17:08:35 filer1 kernel: RIP  [<ffffffff8051f81b>] 
cache_clean+0x11e/0x22f
Mar 17 17:08:35 filer1 kernel:  RSP <ffff81007fe11e40>

Notice the same end related to the "cache_clean" function, maybe this 
means something for anyone ?

Additionally, I noticed that today the rpc.mountd process (from 
nfs-utils 1.1.0-rc1) is 350M big as the "ps" command shows :
root      2091  0.0 17.2 364164 354664 ?     Ss   Apr03   0:27 
rpc.mountd --no-nfs-version 2

And it seems to very slowly grow up in memory. /proc/<pid of 
rpc.mountd>/smaps shows that 350M are taken by the "heap" :
00514000-15eea000 rw-p 00514000 00:00 0 
  [heap]
Size:            354136 kB
Rss:             353756 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:   353756 kB

I don't know if these 2 cases are related, but as I got a crash with 
rpc.mountd (see crash from apr. 2nd), I include it too.

Any thoughts ?

Regards,

Gabriel



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

             reply	other threads:[~2007-04-11 15:13 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-11 15:12 Gabriel Barazer [this message]
  -- strict thread matches above, loose matches on Subject: below --
2007-04-11 15:11 mountd randomly crash and panic the server Gabriel Barazer
2007-04-16  9:26 ` Gabriel Barazer
2007-04-16 10:47   ` Neil Brown
2007-04-16 14:07     ` Gabriel Barazer
2007-04-17  1:23       ` Neil Brown
2007-04-30 23:46         ` Gabriel Barazer
2007-04-30 23:59           ` Neil Brown
2007-05-02  9:03             ` Gabriel Barazer
2007-05-02 11:29               ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=461CFADA.1080305@oxeva.fr \
    --to=gabriel@oxeva.fr \
    --cc=nfs@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.