All of lore.kernel.org
 help / color / mirror / Atom feed
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Mike Grant <mggr@pml.ac.uk>
Cc: linux-nfs@vger.kernel.org
Subject: Re: NFS4 client loop (10025 / BAD_STATEID)
Date: Sun, 8 Apr 2012 17:34:21 -0400	[thread overview]
Message-ID: <20120408213421.GA854@fieldses.org> (raw)
In-Reply-To: <4F7DD5BF.7070003@pml.ac.uk>

On Thu, Apr 05, 2012 at 06:26:23PM +0100, Mike Grant wrote:
> Hi,
> 
> We've recently had some issues with NFS clients hammering servers to a
> crawl due to a loop condition with NFS4 BAD_STATEID.  After trawling the
> archives, I found something similar:
>  http://www.spinics.net/lists/linux-nfs/msg25012.html
>   ("RE: NFS4ERR_STALE_CLIENTID loop" Oct 2011)
> 
> I believe the outcome was that this was probably a Solaris server bug,
> but the archive search makes it tricky to be sure.
> 
> Our issue is similar albeit with BAD_STATEID.  A couple of tcpdumps can
> be found at http://rsg.pml.ac.uk/staff/mggr/linux-nfs/  The clients are
> a bit outdated (Fedora 14, running 2.6.35.14-106.fc14.x86_64).
> 
> This is also against a Solaris server and, while not reproducable on
> demand, happens about once every 2 days.  There are three machines in
> this loop as I write ;)  Anyway, I'm assuming that's Oracle's (and our)
> problem..
> 
> However, we have seen the same situation against a Linux server (RHEL 6,
> 2.6.32-71.el6.x86_64) about two weeks ago.  It occurred when the server
> was rebooted and 2 workstations (out of 40) that were active at the time
> of the reboot went into the same sort of loop when the server
> reappeared.  Unfortunately the workstations were quickly rebooted
> without gathering info and it's not yet reoccurred.
> 
> We're likely to do another reboot sometime after Easter, so I have my
> fingers crossed we'll get a repeat of the issue.  If so, what info and
> conditions would you ideally want us to try and get, bearing in mind
> this is a core operational fileserver?  (i.e. we'd rather not run
> development kernels on it)

Probably most helpful would be to capture the client/server wire
traffic.

Chances are it's very repetitive, so if we can get a long enough snippet
just to see what's going on, that should suffice.

So something like "tcpdump -s0 -wtmp.pcap" run for a second or so after
the problem happens.  (And send us tmp.pcap.  Note text output from
tcpdump is unlikely to be detailed enough.)

Or if you know when you expect the problem to happen, start the capture
before you do the reboot and keep it running until you're sure you've
hit the problem.

--b.

      reply	other threads:[~2012-04-08 21:34 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-05 17:26 NFS4 client loop (10025 / BAD_STATEID) Mike Grant
2012-04-08 21:34 ` J. Bruce Fields [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120408213421.GA854@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mggr@pml.ac.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.