[NFS] Server-side locking issue

Linux NFS development
 help / color / mirror / Atom feed

* [NFS] Server-side locking issue
@ 2008-05-08 22:18 Christian Robottom Reis
       [not found] ` <20080508221815.GB4583-Zkq4WM0RTTBfJ/NunPodnw@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Robottom Reis @ 2008-05-08 22:18 UTC (permalink / raw)
  To: NFS

Today, apparently at random, we had a locking problem on our LAN. Client
applications hung, restarting them led to hangs, and the client dmesgs
showed a familiar:

    [443619.682118] lockd: server anthem not responding, still trying

So the server lockd apparently stopped responding to clients, and
restarting clients got us nowhere. Eventually we cycled the server and
everything's back to normal, but I'm pretty confused as to what
happened. I couldn't scrape any evidence on the server that would point
to why this happened -- no OOPS, error or even warning output.

I was reading through the thread at
http://groups.google.com.br/group/fa.linux.kernel/browse_thread/thread/6c7b5e49a46aef75/91adbb9f298db509?lnk=st&q=nfs+locking+server#91adbb9f298db509
and figured that it might be a similar problem I'm facing, but I'm not
entirely sure as it's hard to say if somebody interrupted a client
program or not (it's a large diskless network).

Clients run 2.6.24-16-generic (stock Ubuntu Hardy) and server is
2.6.22-14-generic (stock Ubuntu Gutsy).

If the problem happens again, what can I do on server and client to
further debug the problem? And is there a utility that clears locks that
we could use to avoid having to restart the server (acking the risks in
cleared locks)?
-- 
Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16] 3376 0125

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [NFS] Server-side locking issue
       [not found] ` <20080508221815.GB4583-Zkq4WM0RTTBfJ/NunPodnw@public.gmane.org>
@ 2008-05-09 15:43   ` J. Bruce Fields
  2008-06-12 21:43     ` Christian Robottom Reis
  0 siblings, 1 reply; 6+ messages in thread
From: J. Bruce Fields @ 2008-05-09 15:43 UTC (permalink / raw)
  To: Christian Robottom Reis; +Cc: NFS

On Thu, May 08, 2008 at 07:18:16PM -0300, Christian Robottom Reis wrote:
> Today, apparently at random, we had a locking problem on our LAN. Client
> applications hung, restarting them led to hangs, and the client dmesgs
> showed a familiar:
> 
>     [443619.682118] lockd: server anthem not responding, still trying
> 
> So the server lockd apparently stopped responding to clients, and
> restarting clients got us nowhere. Eventually we cycled the server and
> everything's back to normal, but I'm pretty confused as to what
> happened. I couldn't scrape any evidence on the server that would point
> to why this happened -- no OOPS, error or even warning output.
> 
> I was reading through the thread at
> http://groups.google.com.br/group/fa.linux.kernel/browse_thread/thread/6c7b5e49a46aef75/91adbb9f298db509?lnk=st&q=nfs+locking+server#91adbb9f298db509
> and figured that it might be a similar problem I'm facing, but I'm not
> entirely sure as it's hard to say if somebody interrupted a client
> program or not (it's a large diskless network).

I don't think the server stopped responding to clients in the case
Miklos described.

Perhaps a sysrq-T dump of lockd would show where (and whether) it's
blocked?  (So once lockd stops responding, log into the server, run
"echo t >/proc/sysrq-trigger", and collect the output from the logs,
especially the stacktrace for the lockd process).

> 
> Clients run 2.6.24-16-generic (stock Ubuntu Hardy) and server is
> 2.6.22-14-generic (stock Ubuntu Gutsy).
> 
> If the problem happens again, what can I do on server and client to
> further debug the problem? And is there a utility that clears locks that
> we could use to avoid having to restart the server (acking the risks in
> cleared locks)?

If the server lockd has completely stopped responding to lockd requests,
then the problem isn't just a stray file lock.

--b.

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [NFS] Server-side locking issue
  2008-05-09 15:43   ` J. Bruce Fields
@ 2008-06-12 21:43     ` Christian Robottom Reis
       [not found]       ` <20080612214340.GA17293-Zkq4WM0RTTBfJ/NunPodnw@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Robottom Reis @ 2008-06-12 21:43 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: NFS, Ronaldo Maia

On Fri, May 09, 2008 at 11:43:05AM -0400, J. Bruce Fields wrote:
> I don't think the server stopped responding to clients in the case
> Miklos described.

Okay. Well, one month later, it happened again to me.

> Perhaps a sysrq-T dump of lockd would show where (and whether) it's
> blocked?  (So once lockd stops responding, log into the server, run
> "echo t >/proc/sysrq-trigger", and collect the output from the logs,
> especially the stacktrace for the lockd process).

This time I did a ps auxww locking for the lockd process. And guess
what?

root      6323  0.0  0.0      0     0 ?        D    Jun01   0:50 [lockd]

I wonder why it's in the D state. I also wonder if there's a way to get
it back once it's in this state -- without reloading the kernel module
or rebooting, I guess.

I've collected a trace, at any rate, but lockd isn't even listed in it --
I can send it in if it makes sense.

What sort of debugging can I do to figure out what's wrong here?

(This is a dual-Xeon running:

    Linux anthem 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux)
-- 
Christian Robottom Reis | http://async.com.br/~kiko/ | [+55 16] 3376 0125

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [NFS] Server-side locking issue
       [not found]       ` <20080612214340.GA17293-Zkq4WM0RTTBfJ/NunPodnw@public.gmane.org>
@ 2008-06-12 22:17         ` Wendy Cheng
  2008-06-12 22:42         ` Wendy Cheng
  2008-06-12 23:50         ` Jeff Layton
  2 siblings, 0 replies; 6+ messages in thread
From: Wendy Cheng @ 2008-06-12 22:17 UTC (permalink / raw)
  To: Christian Robottom Reis; +Cc: Linux NFS Mailing List

Christian Robottom Reis wrote:
>
> This time I did a ps auxww locking for the lockd process. And guess
> what?
>
> root      6323  0.0  0.0      0     0 ?        D    Jun01   0:50 [lockd]
>
> I wonder why it's in the D state. I also wonder if there's a way to get
> it back once it's in this state -- without reloading the kernel module
> or rebooting, I guess.
>
> I've collected a trace, at any rate, but lockd isn't even listed in it --
> I can send it in if it makes sense.
>   

What kind of "trace" data you've collected ? As a rule of thumb, when a 
process is stuck inside the kernel, the best approach is to:

shell> cd /proc
shell> echo w > sysrq-trigger // do this a couple of times
shell> echo t > sysrq-trigger

The "w" will force kernel to print out threads' backtrace that are 
currently on the active CPUs. The "t" will print out all the thread 
backtraces on this machine (but sometime skip the ones spinning on the 
CPUs). These traces will give people a much better idea what went on in 
the kernel at that particular time. All the backtraces should show up in 
/var/log/messages file and/or system console.

*Warning* ... the "t" will pause system for a noticeable amount of time 
(few seconds to few minutes, depending on thread counts) since it has to 
walk thru every thread's stack in that running system. If you have 
cluster configured, it could make the node missing its heartbeat 
processing (so you need to increase the heartbeat interval before doing 
this).

> What sort of debugging can I do to figure out what's wrong here?
>
> (This is a dual-Xeon running:
>
>     Linux anthem 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux)
>   
Another approach is to make a debug kernel and run "crash" to poke the 
live kernel. Dave Anderson from Red Hat has an excellent tutorial in his 
people's page: http://people.redhat.com/anderson . It is also very helpful.

-- Wendy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [NFS] Server-side locking issue
       [not found]       ` <20080612214340.GA17293-Zkq4WM0RTTBfJ/NunPodnw@public.gmane.org>
  2008-06-12 22:17         ` Wendy Cheng
@ 2008-06-12 22:42         ` Wendy Cheng
  2008-06-12 23:50         ` Jeff Layton
  2 siblings, 0 replies; 6+ messages in thread
From: Wendy Cheng @ 2008-06-12 22:42 UTC (permalink / raw)
  To: Christian Robottom Reis; +Cc: J. Bruce Fields, NFS, Ronaldo Maia

Christian Robottom Reis wrote:
> On Fri, May 09, 2008 at 11:43:05AM -0400, J. Bruce Fields wrote:
>
>   
>> Perhaps a sysrq-T dump of lockd would show where (and whether) it's
>> blocked?  (So once lockd stops responding, log into the server, run
>> "echo t >/proc/sysrq-trigger", and collect the output from the logs,
>> especially the stacktrace for the lockd process).
>>     
Sorry, I was debugging some other stuffs on my screen while mindlessly 
reading the mailing list - somehow missed Bruce's original "sysrq-t" 
paragraph. However, look like you've missed the "w" part. My guess is 
that the lockd was spinning on the CPU when you hit the sysrq-t key or 
your sysrq-t didn't run to its completion at all.

-- Wendy

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [NFS] Server-side locking issue
       [not found]       ` <20080612214340.GA17293-Zkq4WM0RTTBfJ/NunPodnw@public.gmane.org>
  2008-06-12 22:17         ` Wendy Cheng
  2008-06-12 22:42         ` Wendy Cheng
@ 2008-06-12 23:50         ` Jeff Layton
  2 siblings, 0 replies; 6+ messages in thread
From: Jeff Layton @ 2008-06-12 23:50 UTC (permalink / raw)
  To: Christian Robottom Reis; +Cc: J. Bruce Fields, NFS, Ronaldo Maia

On Thu, 12 Jun 2008 18:43:40 -0300
Christian Robottom Reis <kiko@canonical.com> wrote:

> On Fri, May 09, 2008 at 11:43:05AM -0400, J. Bruce Fields wrote:
> > I don't think the server stopped responding to clients in the case
> > Miklos described.
> 
> Okay. Well, one month later, it happened again to me.
> 
> > Perhaps a sysrq-T dump of lockd would show where (and whether) it's
> > blocked?  (So once lockd stops responding, log into the server, run
> > "echo t >/proc/sysrq-trigger", and collect the output from the logs,
> > especially the stacktrace for the lockd process).
> 
> This time I did a ps auxww locking for the lockd process. And guess
> what?
> 
> root      6323  0.0  0.0      0     0 ?        D    Jun01   0:50 [lockd]
> 
> I wonder why it's in the D state. I also wonder if there's a way to get
> it back once it's in this state -- without reloading the kernel module
> or rebooting, I guess.
> 
> I've collected a trace, at any rate, but lockd isn't even listed in it --
> I can send it in if it makes sense.
> 

That's not atypical at all. syslog uses unreliable transport. When you
send it a flood of data (say, with a sysrq-t) some of it can be lost.
Usually I recommend dumping the data straight out of the ring buffer
from a sysrq-t:

    # dmesg > /tmp/sysrq-t.out

...or something. You might still lose stuff that got pushed out of the
ring buffer, but the stuff that is there will at least be complete.

> What sort of debugging can I do to figure out what's wrong here?
> 

You'll really need that sysrq-t info...or a core dump, or to run a
debugger on the running kernel (like Wendy recommended).

> (This is a dual-Xeon running:
> 
>     Linux anthem 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux)

There were some patches that went into 2.6.25 (I think) that fix
problems that could cause lockd to hang in some cases. This patch,
in particular, may be of interest:

Subject: [PATCH 1/4] NLM: set RPC_CLNT_CREATE_NOPING for NLM RPC clients

Cheers,
-- 
Jeff Layton <jlayton@redhat.com>

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-06-12 23:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-08 22:18 [NFS] Server-side locking issue Christian Robottom Reis
     [not found] ` <20080508221815.GB4583-Zkq4WM0RTTBfJ/NunPodnw@public.gmane.org>
2008-05-09 15:43   ` J. Bruce Fields
2008-06-12 21:43     ` Christian Robottom Reis
     [not found]       ` <20080612214340.GA17293-Zkq4WM0RTTBfJ/NunPodnw@public.gmane.org>
2008-06-12 22:17         ` Wendy Cheng
2008-06-12 22:42         ` Wendy Cheng
2008-06-12 23:50         ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox