* 3.14.27 client hang on specific file
@ 2014-12-19 3:24 Brian De Wolf
2014-12-23 23:28 ` Brian De Wolf
0 siblings, 1 reply; 2+ messages in thread
From: Brian De Wolf @ 2014-12-19 3:24 UTC (permalink / raw)
To: linux-nfs
Hello,
After updating our kernel from 3.4.x to 3.14.27 (along with nfs-utils
1.2.9), we've had a strange issue with our sec=krb5p NFSv4 mounts.
My initial light testing went fine, but sometimes, on any given host, a
specific file will no longer be accessible. Any attempt to access it
causes the process to go into uninterruptible sleep.
Reproducing this on another host was fairly quick by dd'ing 30MB
from /dev/zero into a file repeatedly. Eventually the dd hangs instead
of completing. Once broken, the simplest test I could think of was
"stat testfile". tcpdump shows no traffic when I run it. Turning
rpcdebug all the way up produces:
kernel: RPC: looking up Generic cred
kernel: NFS: permission(0:28/3), mask=0x1, res=0
kernel: NFS: nfs_lookup_revalidate(/testfile) is valid
Around the time that it breaks, it also prints
kernel: nfs: server servername not responding, still trying
several times, but I couldn't find a way to get it to print it again.
It doesn't look like it follows up with more timeouts or an "OK", so
that seems pretty odd.
Anyone have any ideas? I'm happy to provide more debug info.
Thanks,
Brian
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: 3.14.27 client hang on specific file
2014-12-19 3:24 3.14.27 client hang on specific file Brian De Wolf
@ 2014-12-23 23:28 ` Brian De Wolf
0 siblings, 0 replies; 2+ messages in thread
From: Brian De Wolf @ 2014-12-23 23:28 UTC (permalink / raw)
To: linux-nfs@vger.kernel.org
On Thu, 18 Dec 2014 19:24:21 -0800
Brian De Wolf <bldewolf@cpp.edu> wrote:
> After updating our kernel from 3.4.x to 3.14.27 (along with nfs-utils
> 1.2.9), we've had a strange issue with our sec=krb5p NFSv4 mounts.
> My initial light testing went fine, but sometimes, on any given host,
> a specific file will no longer be accessible. Any attempt to access
> it causes the process to go into uninterruptible sleep.
In case anyone else sees a similar issue, this is what I found.
3.4, 3.12 and 3.14 see stalls when accessing a Solaris 10 server. 3.4
and 3.12 recover by reconnecting after a 60 second timeout, but 3.14
hangs forever. It seems like the 3.14 timeout is broken, which is
pretty painful. It can be recovered by resetting the TCP connection
(yay iptables), but will eventually stall again.
3.16 doesn't stall, so I assume something was fixed between 3.14 and
3.16 to handle whatever problems occur with the Solaris 10 server.
3.12 and 3.14 also don't see stalls when accessing an OmniOS server,
which makes me think the bug is on the Solaris side but triggers poor
error handling on the Linux side (that was fixed in 3.16).
So the end result is pretty obvious: it's time to upgrade.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2014-12-23 23:28 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-19 3:24 3.14.27 client hang on specific file Brian De Wolf
2014-12-23 23:28 ` Brian De Wolf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox