From: Joschi Brauchle <joschi.brauchle@tum.de>
To: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: NFS Client looping with STALE_STATE_ID error - 100% load on one CPU core
Date: Thu, 23 Apr 2015 12:16:10 +0200 [thread overview]
Message-ID: <5538C66A.1020302@tum.de> (raw)
[-- Attachment #1: Type: text/plain, Size: 2497 bytes --]
Hello everyone,
I am posting to this list as suggested by Neil F Brown in this OpenSUSE
bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=909634
We are mounting user homes and other shares using 'sec=krb' and NFSv4.
With kernel 3.16.7+ (includes latest NFS patches, please see BZ link for
details) we are seeing 100% CPU core usage by a kernel process on some
machine connected to a 'NetApp FAS 3140 with ONTAP 8.1.4P1 7-mode'
NFSv3/4 Server. It looks like the problem is triggered when the NFS
server is unavailable for a short period of time or was rebooted.
Neil F Brown has come to the following conclusion so far:
------------- QUOTE -------------
What I see happening here is that a WRITE or LOCK request gets
NFS4ERR_STALE_STATEID, and then the client sends a "RENEW" request.
If the stateid is really stale, that should fail, but it doesn't. So the
client assumes everything is OK and keeps sending requests .... which
fail with NFS4ERR_STALE_STATEID.
STALE_STATEID means:
A stateid generated by an earlier server instance was used.
which means that the server must have restarted since that stateid was
issued.
So there seem to be two possible explanations:
1/ The server is lying. The stateid isn't really 'stale'. Maybe it is
'bad'.
2/ The client has previously handled a STALE_STATEID error and created a
new "client" handle, but hasn't refreshed all of the stateids that it
was holding. So it is trying to use old stateid even though it has
established a new client.
The client code looks OK, though it is sufficiently complex that I
cannot be certain. There are no patches in more recent versions.
------------- END -------------
We have some RPC/NFS debug logs available here:
https://bugzilla.novell.com/attachment.cgi?id=631962
And a wireshark trace here:
https://bugzilla.novell.com/attachment.cgi?id=632104
What I cannot do:
- Trigger or reproduce the problem reliably. It comes up every once in a
while. Around 2-3 times a month on a handful of ~50 machines.
- Provide logs from with the latest kernel (because I cannot trigger the
problem by any means. Updating all ~50 machines is not an option)
What I can do:
- Provide logs from or possibly even access to the machine in the
'broken' state for the next couple of days. I will have to reboot it at
some time to take the load of the NFS server.
I would be grateful for any help or comments.
-J Brauchle
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4917 bytes --]
next reply other threads:[~2015-04-23 10:22 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-23 10:16 Joschi Brauchle [this message]
2015-05-05 17:24 ` NFS Client looping with STALE_STATE_ID error - 100% load on one CPU core Benjamin Coddington
2015-05-05 18:43 ` Joschi Brauchle
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5538C66A.1020302@tum.de \
--to=joschi.brauchle@tum.de \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.