Re: nfs client: Now you see it, now you don't (aka spurious ESTALE errors)

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Larry Keegan <lk@pfw.demon.co.uk>
To: Jeff Layton <jlayton@redhat.com>
Cc: <linux-nfs@vger.kernel.org>
Subject: Re: nfs client: Now you see it, now you don't (aka spurious ESTALE errors)
Date: Thu, 25 Jul 2013 17:05:26 +0000	[thread overview]
Message-ID: <20130725170526.6e54c7db@cs3.al.itld> (raw)
In-Reply-To: <20130725101143.6a22cb81@corrin.poochiereds.net>

On Thu, 25 Jul 2013 10:11:43 -0400
Jeff Layton <jlayton@redhat.com> wrote:
> On Thu, 25 Jul 2013 13:45:15 +0000
> Larry Keegan <lk@pfw.demon.co.uk> wrote:
> 
> > Dear Chaps,
> > 
> > I am experiencing some inexplicable NFS behaviour which I would
> > like to run past you.
> > 
> > I have a linux NFS server running kernel 3.10.2 and some clients
> > running the same. The server is actually a pair of identical
> > machines serving up a small number of ext4 filesystems atop drbd.
> > They don't do much apart from serve home directories and deliver
> > mail into them. These have worked just fine for aeons.
> > 
> > The problem I am seeing is that for the past month or so, on and
> > off, one NFS client starts reporting stale NFS file handles on some
> > part of the directory tree exported by the NFS server. During the
> > outage the other parts of the same export remain unaffected. Then,
> > some ten minutes to an hour later they're back to normal. Access to
> > the affected sub-directories remains possible from the server (both
> > directly and via nfs) and from other clients. There do not appear
> > to be any errors on the underlying ext4 filesystems.
> > 
> > Each NFS client seems to get the heebie-jeebies over some directory
> > or other pretty much independently. The problem affects all of the
> > filesystems exported by the NFS server, but clearly I notice it
> > first in home directories, and in particular in my dot
> > subdirectories for things like my mail client and browser. I'd say
> > something's up the spout about 20% of the time.
> > 
> > The server and clients are using nfs4, although for a while I tried
> > nfs3 without any appreciable difference. I do not have
> > CONFIG_FSCACHE set.
> > 
> > I wonder if anyone could tell me if they have ever come across this
> > before, or what debugging settings might help me diagnose the
> > problem?
> > 
> > Yours,
> > 
> > Larry
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-nfs" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> Were these machines running older kernels before this started
> happening? What kernel did you upgrade from if so?
> 

Dear Jeff,

The full story is this:

I had a pair of boxes running kernel 3.4.3 with the aforementioned drbd
pacemaker malarkey and some clients running the same.

Then I upgraded the machines by moving from plain old dos partitions to
gpt. This necessitated a complete reload of everything, but there were
no software changes. I can be sure that nothing else was changed
because I build my entire operating system in one ginormous makefile.

Rapidly afterwards I switched the motherboards for ones with more PCI
slots. There were no software changes except those relating to MAC
addresses.

Next I moved from 100Mbit to gigabit hubs. Then the problems started.

The symptoms were much as I've described but I didn't see them that
way. Instead I assumed the entire filesystem had gone to pot and tried
to unmount it from the client. Fatal mistake. umount hung. I was left
with an entry in /proc/mounts showing the affected mountpoints as
"/home/larry\040(deleted)" for example. It was impossible to get rid of
this and I had to reboot the box. Unfortunately the problem
snowballed and affected all my NFS clients and the file servers, so
they had to be bounced too.

Anyway, to cut a long story short, this problem seemed to me to be a
file server problem so I replaced network cards, swapped hubs,
checked filesystems, you name it, but I never experienced any actual
network connectivity problems, only NFS problems. As I had kernel 3.4.4
upgrade scheduled I upgraded all the hosts. No change.

Then I upgraded everything to kernel 3.4.51. No change.

Then I tried mounting using NFS version 3. It could be argued the
frequency of gyp reduced, but the substance remained.

Then I bit the bullet and tried kernel 3.10. No change. I noticed that
NFS_V4_1 was on so I turned it off and re-tested. No change. Then
I tried 3.10.1 and 3.10.2. No change.

I've played with the kernel options to remove FSCACHE, not that I was
using it, and that's about it.

Are there any (client or server) kernel options which I should know
about?

> What might be helpful is to do some network captures when the problem
> occurs. What we want to know is whether the ESTALE errors are coming
> from the server, or if the client is generating them. That'll narrow
> down where we need to look for problems.

As it was giving me gyp during typing I tried to capture some NFS
traffic. Unfortunately claws-mail started a mail box check in the
middle of this and the problem disappeared! Normally it's claws which
starts this. It'll come along again soon enough and I'll send a trace.

Thank you for your help.

Yours,

Larry.

next prev parent reply	other threads:[~2013-07-25 17:05 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-25 13:45 nfs client: Now you see it, now you don't Larry Keegan
2013-07-25 14:11 ` nfs client: Now you see it, now you don't (aka spurious ESTALE errors) Jeff Layton
2013-07-25 14:24   ` Myklebust, Trond
2013-07-25 14:33     ` Jeff Layton
2013-07-25 14:41       ` Myklebust, Trond
2013-07-25 17:05   ` Larry Keegan [this message]
2013-07-25 18:18     ` Jeff Layton
2013-07-26 12:41       ` Larry Keegan
2013-07-26 13:12         ` Jeff Layton
2013-07-26 15:02           ` J. Bruce Fields
2013-07-26 22:25             ` Larry Keegan
2013-07-31 14:03               ` J. Bruce Fields
2013-07-31 19:50                 ` Larry Keegan
2013-07-31 20:35                   ` J. Bruce Fields
2013-07-26 16:10           ` Larry Keegan
2013-07-26 14:59     ` J. Bruce Fields
2013-07-26 23:21       ` Larry Keegan
2013-08-06 11:02         ` Larry Keegan
2013-08-06 11:14           ` Jeff Layton
2013-08-06 13:34             ` J. Bruce Fields
2013-08-06 15:38               ` Larry Keegan
2013-08-19 21:16       ` Bruce Guenter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130725170526.6e54c7db@cs3.al.itld \
    --to=lk@pfw.demon.co.uk \
    --cc=jlayton@redhat.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).