v4recovery client id lockup

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* v4recovery client id lockup
@ 2012-02-23  1:06 Louie
  2012-02-23 16:52 ` Jeff Layton
  0 siblings, 1 reply; 5+ messages in thread
From: Louie @ 2012-02-23  1:06 UTC (permalink / raw)
  To: linux-nfs

We have a weird, intermittent issue with NFS that I've been trying to
track down for the past 6 months. This is on NFS v4, mounted over SSH,
with Centos 6.2 as client/server.

Periodically, when running a client-side command that reads a large amount of
files (e.g. converting 2000 small picture files to another format over
NFS), our server will completely lock up for a period of time. ATOP
shows 50-90% IO activity on the sda drive (root system, but not the
shared NFS area where the files are actually located).

I've finally tracked down the activity to the
/var/lib/nfs/v4recovery directory. One of the client ID directories
gets created/deleted over and over again (same name each time) -
enough to completely lock up the system. If I sit on the directory
while this is happening and do "ls" commands over and over, you can
see it disappear and appear ("ls -i" shows new inode numbers).

The strange thing is that this is periodic, and if you simply
kill the client process and restart, everything often works smoothly. The
actual server IO activity seems to be coming from the journal (what
appears in iostat), but it's only writing/rewriting the empty client
ID directories (the size of the activity shows 0.0 kb/s).

I've searched everywhere for info on this directory and trying to
debug this stuff in general and come up empty, sorry if this has been
covered before.

Appreciate ANY help, this has been driving me completely crazy.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: v4recovery client id lockup
  2012-02-23  1:06 v4recovery client id lockup Louie
@ 2012-02-23 16:52 ` Jeff Layton
  2012-02-24 21:08   ` Louie
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Layton @ 2012-02-23 16:52 UTC (permalink / raw)
  To: Louie; +Cc: linux-nfs

On Wed, 22 Feb 2012 17:06:49 -0800
Louie <snikrep@gmail.com> wrote:

> We have a weird, intermittent issue with NFS that I've been trying to
> track down for the past 6 months. This is on NFS v4, mounted over SSH,
> with Centos 6.2 as client/server.
> 
> Periodically, when running a client-side command that reads a large amount of
> files (e.g. converting 2000 small picture files to another format over
> NFS), our server will completely lock up for a period of time. ATOP
> shows 50-90% IO activity on the sda drive (root system, but not the
> shared NFS area where the files are actually located).
> 
> I've finally tracked down the activity to the
> /var/lib/nfs/v4recovery directory. One of the client ID directories
> gets created/deleted over and over again (same name each time) -
> enough to completely lock up the system. If I sit on the directory
> while this is happening and do "ls" commands over and over, you can
> see it disappear and appear ("ls -i" shows new inode numbers).
> 
> The strange thing is that this is periodic, and if you simply
> kill the client process and restart, everything often works smoothly. The
> actual server IO activity seems to be coming from the journal (what
> appears in iostat), but it's only writing/rewriting the empty client
> ID directories (the size of the activity shows 0.0 kb/s).
> 
> I've searched everywhere for info on this directory and trying to
> debug this stuff in general and come up empty, sorry if this has been
> covered before.
> 
> Appreciate ANY help, this has been driving me completely crazy.

Those directories are for the server to tell what clients are allowed
to reclaim locks or not. There are some problems that can occur when
there are server reboots in conjunction with a network partition
between server and client. See section 8.6.3 in RFC3530 if you're
interested in the gory details...

In any case, nfsd tracks some info in that directory in order to deal
with those cases. It's certainly possible there is a bug in that code
though. I fixed a few subtle bugs in that code recently, with this
patchset which I've proposed for 3.4:

    [PATCH v6 0/5] nfsd: overhaul the client name tracking code

...but none that sound similar to what you're seeing. Still, you may
want to play with that and see whether it helps this case at all. You
won't need the userspace pieces if you're still using the legacy client
tracking code.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: v4recovery client id lockup
  2012-02-23 16:52 ` Jeff Layton
@ 2012-02-24 21:08   ` Louie
       [not found]     ` <CAHHaOuasyyQY7p+HCRwyYuJDT0mmmXUpUCivfp9D8nNRUQ9qDg@mail.gmail.com>
  2012-02-28 19:45     ` J. Bruce Fields
  0 siblings, 2 replies; 5+ messages in thread
From: Louie @ 2012-02-24 21:08 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

Thanks for the help, I think I've tracked this down in case anybody
else ever runs into same issue.

We have multiple clients connecting via SSH tunnels, so all NFS
traffic is routed through localhost (127.0.0.1) on these open ports.

The problem appears to be the NFS server only partially recognizing
diff. between these clients through local tunnels. Upon each
alternating connection, the /var/lib/nfs/rpc_pipefs/nfsd4_cb/clntID
directory is replaced with a new one (shows the same IP address of
127.0.0.1, but a new port). The v4recovery client's hash directory is
removed/replaced with the exact same hash. Obviously, when multiple
clients are hitting the box at the same time, this causes a lockup.

I'm guessing there is no solution and our setup just isn't supported.
I'm leaning towards ditching the SSH tunnels and going with
unencrypted traffic for now, as it's not strictly necessary. But if
anybody has a tip on how to fix, would love to hear.

Thanks for the help!

On Thu, Feb 23, 2012 at 8:52 AM, Jeff Layton <jlayton@redhat.com> wrote:
> On Wed, 22 Feb 2012 17:06:49 -0800
> Louie <snikrep@gmail.com> wrote:
>
>> We have a weird, intermittent issue with NFS that I've been trying to
>> track down for the past 6 months. This is on NFS v4, mounted over SSH,
>> with Centos 6.2 as client/server.
>>
>> Periodically, when running a client-side command that reads a large amount of
>> files (e.g. converting 2000 small picture files to another format over
>> NFS), our server will completely lock up for a period of time. ATOP
>> shows 50-90% IO activity on the sda drive (root system, but not the
>> shared NFS area where the files are actually located).
>>
>> I've finally tracked down the activity to the
>> /var/lib/nfs/v4recovery directory. One of the client ID directories
>> gets created/deleted over and over again (same name each time) -
>> enough to completely lock up the system. If I sit on the directory
>> while this is happening and do "ls" commands over and over, you can
>> see it disappear and appear ("ls -i" shows new inode numbers).
>>
>> The strange thing is that this is periodic, and if you simply
>> kill the client process and restart, everything often works smoothly. The
>> actual server IO activity seems to be coming from the journal (what
>> appears in iostat), but it's only writing/rewriting the empty client
>> ID directories (the size of the activity shows 0.0 kb/s).
>>
>> I've searched everywhere for info on this directory and trying to
>> debug this stuff in general and come up empty, sorry if this has been
>> covered before.
>>
>> Appreciate ANY help, this has been driving me completely crazy.
>
> Those directories are for the server to tell what clients are allowed
> to reclaim locks or not. There are some problems that can occur when
> there are server reboots in conjunction with a network partition
> between server and client. See section 8.6.3 in RFC3530 if you're
> interested in the gory details...
>
> In any case, nfsd tracks some info in that directory in order to deal
> with those cases. It's certainly possible there is a bug in that code
> though. I fixed a few subtle bugs in that code recently, with this
> patchset which I've proposed for 3.4:
>
>    [PATCH v6 0/5] nfsd: overhaul the client name tracking code
>
> ...but none that sound similar to what you're seeing. Still, you may
> want to play with that and see whether it helps this case at all. You
> won't need the userspace pieces if you're still using the legacy client
> tracking code.
>
> --
> Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <CAHHaOuasyyQY7p+HCRwyYuJDT0mmmXUpUCivfp9D8nNRUQ9qDg@mail.gmail.com>]

* Re: v4recovery client id lockup
       [not found]     ` <CAHHaOuasyyQY7p+HCRwyYuJDT0mmmXUpUCivfp9D8nNRUQ9qDg@mail.gmail.com>
@ 2012-02-25  0:32       ` Louie
  0 siblings, 0 replies; 5+ messages in thread
From: Louie @ 2012-02-25  0:32 UTC (permalink / raw)
  To: David Brodbeck; +Cc: linux-nfs

Good idea, went ahead and implemented this, and everything appears to
be working.

Sure glad to solve this, was driving us absolutely nuts (it would
freeze the NFS connections to our lab computers, which would crash
everything).

Many thanks!!
-Louie

On Fri, Feb 24, 2012 at 2:29 PM, David Brodbeck <brodbd@uw.edu> wrote:
>
>
> On Fri, Feb 24, 2012 at 1:08 PM, Louie <snikrep@gmail.com> wrote:
>>
>> I'm guessing there is no solution and our setup just isn't supported.
>> I'm leaning towards ditching the SSH tunnels and going with
>> unencrypted traffic for now, as it's not strictly necessary. But if
>> anybody has a tip on how to fix, would love to hear.
>>
>
> You could always switch to a VPN solution of some kind, such as OpenVPN.
>  This would let your clients have different IPs while still preserving the
> security advantages of an SSH tunnel.
>
> --
> David Brodbeck
> System Administrator, Linguistics
> University of Washington
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: v4recovery client id lockup
  2012-02-24 21:08   ` Louie
       [not found]     ` <CAHHaOuasyyQY7p+HCRwyYuJDT0mmmXUpUCivfp9D8nNRUQ9qDg@mail.gmail.com>
@ 2012-02-28 19:45     ` J. Bruce Fields
  1 sibling, 0 replies; 5+ messages in thread
From: J. Bruce Fields @ 2012-02-28 19:45 UTC (permalink / raw)
  To: Louie; +Cc: Jeff Layton, linux-nfs

On Fri, Feb 24, 2012 at 01:08:54PM -0800, Louie wrote:
> Thanks for the help, I think I've tracked this down in case anybody
> else ever runs into same issue.
> 
> We have multiple clients connecting via SSH tunnels, so all NFS
> traffic is routed through localhost (127.0.0.1) on these open ports.
> 
> The problem appears to be the NFS server only partially recognizing
> diff. between these clients through local tunnels. Upon each
> alternating connection, the /var/lib/nfs/rpc_pipefs/nfsd4_cb/clntID
> directory is replaced with a new one (shows the same IP address of
> 127.0.0.1, but a new port). The v4recovery client's hash directory is
> removed/replaced with the exact same hash. Obviously, when multiple
> clients are hitting the box at the same time, this causes a lockup.
> 
> I'm guessing there is no solution and our setup just isn't supported.
> I'm leaning towards ditching the SSH tunnels and going with
> unencrypted traffic for now, as it's not strictly necessary. But if
> anybody has a tip on how to fix, would love to hear.

That's very strange: those directory names are created as a hash of the
clientid that the client sends in setclientid.

Hm, but the linux client generates those using its idea of its IP and
the server's IP, and maybe those both end up being the same for all your
clients.  Giving each client mount command a distinct cilentaddr= option
might help?

And I also wonder if the server's doing the right thing here--could be
it should be returning a clientid_inuse error to most of the clients
instead of whatever it's currently doing (probably assuming it has one
client that's rebooting continually)--but it may be hard for it to tell
the difference.

--b.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-02-28 19:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-23  1:06 v4recovery client id lockup Louie
2012-02-23 16:52 ` Jeff Layton
2012-02-24 21:08   ` Louie
     [not found]     ` <CAHHaOuasyyQY7p+HCRwyYuJDT0mmmXUpUCivfp9D8nNRUQ9qDg@mail.gmail.com>
2012-02-25  0:32       ` Louie
2012-02-28 19:45     ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).