All of lore.kernel.org
 help / color / mirror / Atom feed
* Help diagnosing bizarre NFS problem
@ 2005-01-14  3:16 Nathan Ollerenshaw
  2005-01-14  4:34 ` Trond Myklebust
  2005-01-24  9:00 ` Nathan Ollerenshaw
  0 siblings, 2 replies; 8+ messages in thread
From: Nathan Ollerenshaw @ 2005-01-14  3:16 UTC (permalink / raw)
  To: nfs

Hi All,

I need some help diagnosing a bizarre problem that has been affecting 
our NFS based system for the past 4 weeks or so. We've tried getting 
help from our NAS vendor (EMC) and we've been trawling google (and 
these list archives) and so far not found any real indication of what 
the problem might be.

Please, if you have any ideas of how to diagnose the problem, please 
let us know.

THE SETUP:

We have an EMC NAS box, a Celerra, running DART (so they tell us, we 
don't have access to it, so I have no idea whats going on in the server 
side).

We have a pair of foundry networks 48 port 10/100 switches providing a 
switched network for the machines. There are 6 mail servers and 4 web 
servers (soon to be 8) that mount a filesystem each from the NAS. We 
have a filesystem for mail data (stored in Maildir format) and a 
filesystem for web content. The webservers see about 20 million 
requests a day, load balanced across all of them with a pair of 
foundries.

Mail is stored in the format of 
/data/mail/xx/xx/domain/user@domain/Maildir/. Web is stored in the 
format of /data/web/xx/xx/domain/ with directories under here for the 
www docroot, cgi-bin, etc. Customers can upload their stuff with FTP, 
and they can put cgis into the cgi-bin if they want. We have a wrapper 
that runs the CGIs as the user's UID/GID in their cgi-bin.

We have a box that runs a custom 'administration UI' that makes all the 
changes to DNS files, apache configs, filesystem etc to provision 
customer's websites/email etc. There is a box that is currently doing a 
backup over the NFS (because the snapshots were misconfigured by me on 
the NAS, ha ha). It takes about 12 hours to read all the data and tar 
it up.

All the client machines are recently patched Fedora Core 2 machines 
running 2.6.9 (currently, we will probably try 2.6.10 in the near 
future)

THE PROBLEM:

regularly, about once a day, at no specific time, each of the web 
servers NFS mount will 'lock up'. This seems to manifest itself in one 
of the deeper directories first, until it works its way down to the 
actual mount point, at which time the machine basically is unable to 
serve any traffic.

When we log into the machine, we see:

Jan  6 09:17:00 www4 kernel: nfs: server nfs not responding, still 
trying
Jan  6 09:17:00 www4 kernel: nfs: server nfs not responding, still 
trying
Jan  6 09:20:47 www4 kernel: nfs: server nfs not responding, still 
trying

Sometimes we will see a message like this:

Dec 27 10:41:51 www2 kernel: nfs_statfs: statfs error = 512

Messages such as this are also common:

nfs_proc_symlink: lock/DGMDNP-042.txt_lock_lock already exists??

Doing a tethereal at the time, we see stuff like this:

  62.303877  10.128.1.11 -> 10.128.2.33  NFS V3 WRITE Reply (Call In 27) 
Error:ERR_STALE

Now, the "statfs error = 512" seems to be indicating that the NAS is 
having a problem. But this isn't the case. I can at least check the 
uptime of the EMC from the control panel UI that EMC provide (which I'm 
not very happy with, but thats another saga you can ask me about in 
private if your interested). The NAS itself is not rebooting. The RPC 
services it provides are not going away either; I have a script running 
on another machine that checks the services every second on the server, 
and they have never even flinched. So I don't think its a problem with 
the EMC crashing or whatnot.

What IS interesting is that the www servers have this problem about 
once or twice a day, each. The mailservers rarely have this problem. 
The machine that does the backup never seems to have the problem.

This issue is really doing my head in. If someone could tell me a way 
of getting more information out of the clients to enable us to see what 
is going on, that'd be awesome.

We've tried using UDP, dropping the packet size, dropping back to the 
latest vanilla 2.4.x kernel, everything we can think of. Nothing seems 
to be helping right now.

Our vendor is of course helping us, they have done tcp dumps on the 
server side, done whatever diagnosis they can on their side and right 
now they are saying its a client side issue, but they are unable to 
provide any hard evidence either way.

Vendor currently says:

> Anyway, at the present moment, we can say, we haven't finished 
> analyzing
> network traces completely, however, we found some strange point in the
> network trace. As per customer, customer uses the file locking over 
> NFS.
> Indeed, we can see NLM protocol in the network trace. Some of  clients 
> keep
> sending NLM_UNLOCK for some of files without sending NLM_LOCK. 
> Generally, if
> using NLM, the sequence is NLM_LOCK call for relevant file is executed 
> from
> NFS client and then NLM_UNLOCK for that file is executed from NFS 
> client.
> Thus, the file locking will be completed. We can't see any corresponds
> between LOCK and UNLOCK. From the beginning of the trace, some of 
> client
> keep sending only NLM_UNLOCK. That is very strange.

There is nothing in the RFC that says that NLM_LOCK and NLM_UNLOCK 
counts must be equal. Additionally, the extra NLM_UNLOCK messages 
simply indicates that there was a failure to lock or unlock a file. 
 From what I understand, this technique is used in crash recovery, which 
seems to indicate SOMETHING is crashing, if not the EMC, what? How can 
I prove it either way?

If anyone on this list can suggest anything obvious or not, it will be 
appreciated :)

Regards,

Nathan.

-- 
"It is change, continuing change, inevitable change, that is
  the dominant factor in society today. No sensible decision can
  be made any longer without taking into account not only the
  world as it is, but the world as it will be." - Isaac Asimov



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-01-27 12:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-14  3:16 Help diagnosing bizarre NFS problem Nathan Ollerenshaw
2005-01-14  4:34 ` Trond Myklebust
2005-01-24  9:00 ` Nathan Ollerenshaw
2005-01-24 12:40   ` Neil Horman
2005-01-27  2:00     ` Nathan Ollerenshaw
2005-01-27 12:26       ` Neil Horman
2005-01-24 18:17   ` David Dougall
2005-01-27  2:25     ` Nathan Ollerenshaw

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.