public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
* NFS stops responding
@ 2010-04-14 21:06 Michael O'Donnell
  2010-04-15 18:04 ` J. Bruce Fields
  2010-04-17  0:17 ` Dennis Nezic
  0 siblings, 2 replies; 12+ messages in thread
From: Michael O'Donnell @ 2010-04-14 21:06 UTC (permalink / raw)
  To: linux-nfs

I've run out of clues (EBRAINTOOSMALL) trying to solve an NFS puzzle
at a remote customer site and I'm hoping to get unstuck.

Symptoms: after approx 1 hour of apparently normal behavior, operations
like 'df -k' or 'ls -l' hang for minutes at a time and then fail with
I/O errors on any of three machines when such operations refer to NFS
mounted directories.

The 3 machines have NFS relationships thus:

   A mounts approx 6             directories from B (A->B)
   B mounts approx 6 (different) directories from A (B->A)
   C mounts approx 6 directories from A (C->A) (same dirs as in B->A)
   C mounts approx 6 directories from B (C->B) (same dirs as in A->B)

Weirdly, when the failure occurs, doing this on all 3 machines:

    umount -f -l -a -t nfs

...followed by this:

    mount -a -t nfs

...on all 3 gets things unstuck for another hour.  (?!?!)

All three systems (HP xw8600 workstations) started life running
bit-for-bit identical system images (based on x86_64 CentOS5.4)
and only differ in which of our apps and configs are loaded.

Kernel is 2.6.18-92.1.17.el5.centos.plus

All 3 systems were previously running an old RHEL3 distribution on the
same hardware with no problems.

Each machine has only two interfaces defined: 'lo' and 'eth0' with the
latter being a wired gigE.

All MTUs are the standard 1500; nothing like jumbo packets in use.

Each machine has a statically assigned address - no DHCP in play.

All systems are connected via a common Dell 2608 PowerConnect switch
that's believed (but not conclusively proven) to be functioning properly.

I've tried specifying both UDP and TCP in the fstab lines.

We're using the default NFSv3.

I've disabled selinux.

The output of 'iptables -L' for all rules in all (filter,nat,mangle,raw)
chains on all machines shows as '(policy ACCEPT)'.

Each machine always shows the same 3 routes when queried via 'route -n'.

The ARP caches show nothing unexpected on any machine.

These commands:

    service nfs status ; service portmap status

...indicate nominal conditions (all expected daemons reported running)
when things are working but also when things are b0rken.

There wasn't anything very informative in /var/log/messages with the
default debug levels but messages are now accumulating there at firehose
rates because I enabled debug for everything, thus:

    for m in rpc nfs nfsd nlm; do rpcdebug -m $m -s all; done

After machine A exhibited the problem I *think* I see evidence in the
/var/log/messages that the NFS client code believes it never got a
response from the server (B) to some NFS request, so it retransmits the
request and (I think) it then concludes that the retransmitted request
also went unanswered so the operation is errored out.

I'm capturing dumps of Enet traffic on the client and server boxes at
the remote customer site thus:

    dumpcap -i eth0 -w /tmp/`hostname`.pcap

...and then copying the dumps back to HQ where I feed them to Wireshark.
I am not (yet?) rigged up so I can sniff traffic from an objective
third party.

When I display the client traffic log file with Wireshark, it (apparently)
confirms that the client did indeed wait a while and then (apparently)
retransmitted the NFS request.  The weird thing is that Wireshark analysis
of corresponding traffic on the server shows the first request coming in
and being replied to immediately, then we later see the retransmitted
request arrive and it, too, is promptly processed and the response goes
out immediately.  So, if I'm reading these tea leaves properly it's as if
that client lost the ability to recognize the reply to that request.  [?!]

But, then, how could it be that all 3 machines seem to get into this state
at more or less the same time?  and why would unmounting and remounting
all NFS filesystems then "fix" it?   Aaaiiieeee!!!

  [ Unfortunately, this problem is only occuring at the one
    customer site and can't be reproduced in-house, so unless
    I can find a way to first sanitize the logs I may not be
    permitted (lucky you!) to publish them here...       >-/  ]

A Wireshark rendering of relevant traffic while observing as
'ls -l mountPoint' on the client hangs and then return with 'I/O Error' :

   On CLIENT A:
   #     Time       SRC DST PROT INFO
   1031  1.989127   A   B   NFS  V3   GETATTR Call, FH:0x70ab15aa
   4565  10.121595  B   A   NFS  V3   GETATTR Call, FH:0x00091508
   4567  10.124981  A   B   NFS  V3   FSSTAT  Call, FH:0x17a976a8
   4587  10.205087  A   B   NFS  V3   GETATTR Call, FH:0xf2c997c8
   29395 61.989380  A   B   NFS  V3   GETATTR Call, FH:0x70ab15aa [retransmission of #1031]
   66805 130.119722 B   A   NFS  V3   GETATTR Call, FH:0x0089db89
   66814 130.124815 A   B   NFS  V3   FSSTAT  Call, FH:0x18a979a8
   97138 181.989898 A   B   NFS  V3   GETATTR Call, FH:0x70ab15aa

   On SERVER B:
   #     Time       SRC DST PROT INFO
   677   1.342486   A   B   NFS  V3   GETATTR Call, FH:0x70ab15aa
   4045  9.474848   B   A   NFS  V3   GETATTR Call, FH:0x00091508
   4047  9.478325   A   B   NFS  V3   FSSTAT  Call, FH:0x17a976a8
   4076  9.558433   A   B   NFS  V3   GETATTR Call, FH:0xf2c997c8
   28625 61.342630  A   B   NFS  V3   GETATTR Call, FH:0x70ab15aa [retransmission of #677]
   61257 129.472779 B   A   NFS  V3   GETATTR Call, FH:0x0089db89
   61268 129.477965 A   B   NFS  V3   FSSTAT  Call, FH:0x18a979a8
   87631 181.342989 A   B   NFS  V3   GETATTR Call, FH:0x70ab15aa

I don't really trust my interpretation of what Wireshark is showing
me but, if I'm correct, the problem is not that we stop seeing return
traffic from the server, it's more that the client stops making sane
decisions in response when it arrives.  Maybe the packets aren't getting
all the way back up the stack to be processed by the client code?

All other network plumbing appears to be in working order while the
problem is occurring - I can connect from one system to another at will
via SSH, rsync, HTTP, ping, etc.

I'd love to blame the switch, and I just acquired a brand new one to
use as an experimental replacement for the one currently deployed.
I'll be ecstatic if that fixes thing, though I'm not optimistic.

I'm assuming this mess is somehow due either to a site-specific botch
in something like a config file or else maybe that switch.  We have
a number of other customers with identical rigs (same software on the
same workstations) that work fine, so (hoping!)  it seems unlikely that
there's an inherent flaw in the SW or HW...

Analysis is awkward because the customers in question are trying to make
what use they can of the machines even as these problems are ocurring
around them, so reboots and other dramatic acts have to be scheduled
well in advance.

I know of no reasons in principle why two machines can't simultaneously
act as NFS clients and NFS servers - are there any?  AFAIK the two
subsystems are separate and have no direct dependencies or interactions;
does anybody know otherwise?  Yes, I'm aware that some systems can be
misconfigured such that cross-mounting causes problems at boot-time as
they each wait for the other's NFS server to start, but this ain't that...

Any help or clues gratefully accepted...

   --M




^ permalink raw reply	[flat|nested] 12+ messages in thread
* NFS stops responding
@ 2004-09-30 13:39 Douglas Furlong
  2004-09-30 16:06 ` Jason Holmes
  0 siblings, 1 reply; 12+ messages in thread
From: Douglas Furlong @ 2004-09-30 13:39 UTC (permalink / raw)
  To: nfs


[-- Attachment #1.1: Type: text/plain, Size: 2069 bytes --]

Good morning all.

Considering the exceedingly fast and speedy response I got yesterday
with regards to my problem accessing edirectory.co.uk I thought I would
try my luck with an NFS problem.

All our unix systems at work have their home directory mounted via NFS
to allow hot seating (not that they ever use it!).

I have just recently upgraded to Fedora Core 2, running the most recent
kernel.

All the workstations are running Fedora Core 2, with the second from
last kernel (due to CIFS/SMB problems in the latest one).

Unfortunately there are two users who's connection to the NFS server is
dropped and does not seem to want to reconnect. To date I have.

1) Replaced both of their PC's
2) Replaced switch
3) will replace network cables tomorrow
4) I have tried numerous version of the kernel including the testing
kernel from rawhide.
5) Tried variations in the timeo=x value to see if that will help.

These lockups vary in time between 30 minutes and 5 hours. Network
connections are not affected by this lock up, I am able to ssh on to the
box (that's how I collected the tcpdump data).

I also have two windows PC's on this switch and things appear to be
fine.

I have 7 or 8 other systems running linux on the network and NFS
communication is not affected.

I have increased the number of servers on the NFS server from 8 to 16. I
did this by editing /etc/init.d/nfs (don't think this is of any help).

I took some tcpdump info on both the client and the server to try and
see if I can work out what is going on. Initially it is not providing me
with much information (but loads of data).

I have attached two files, one from the client and one from the server.
Main reason for attaching them is due to length of data. I had wanted to
attach them as plain text to simplify access, but at 100k it's a bit too
large.
I didn't want to cut them down too much just in case I removed some
pertinent information :(
-- 
Douglas Furlong
Systems Administrator
Firebox.com
T: 0870 420 4475        F: 0870 220 2178

[-- Attachment #1.2: tcpdump_output_nfs_client_server.txt.tar.gz --]
[-- Type: application/x-compressed-tar, Size: 14515 bytes --]

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-04-28 15:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-14 21:06 NFS stops responding Michael O'Donnell
2010-04-15 18:04 ` J. Bruce Fields
2010-04-17  0:17 ` Dennis Nezic
     [not found]   ` <20100416201700.215b0bea.dennisn-YN8wfZw00oOZ9vWoFJJngh2eb7JE58TQ@public.gmane.org>
2010-04-19 14:34     ` Michael O'Donnell
     [not found]       ` <4BCC69E4.70405-kx56TfycDUc@public.gmane.org>
2010-04-22 15:19         ` Dennis Nezic
2010-04-28 15:51           ` Dennis Nezic
  -- strict thread matches above, loose matches on Subject: below --
2004-09-30 13:39 Douglas Furlong
2004-09-30 16:06 ` Jason Holmes
2004-09-30 19:10   ` Jason Holmes
2004-10-01 15:40     ` Jason Holmes
2004-10-07 10:56       ` Douglas Furlong
2004-10-13 15:07         ` Jason Holmes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox