Garrick Staples wrote:
> On Tue, Jan 25, 2005 at 10:06:27PM -0800, Trond Myklebust alleged:
> 
>>ty den 25.01.2005 Klokka 09:39 (-0800) skreiv Garrick Staples:
>>
>>>Hi all,
>>>   I have lots of storage in a large Solaris samfs environment that is NFS
>>>shared to a large number of Solaris and RHEL3 clients.  Under some conditions,
>>>linux apps have been getting stale filehandles during the normal course of
>>>their activity.  Various file handling syscalls like read() or open() might
>>>error.  Lots of renames and setattrs calls seem to trigger the problem.  
>>>'ci' and 'cvs commit' are particularly good at this.
>>
>>ESTALE is usually a sign that someone is deleting a file on the server
>>that is in use by the client. It is a sign that you are doing something
>>that violates the caching rules of NFS.
> 
> 
> Nothing of the kind is happening here.  I've tested this a thousand times over
> the last few days trying to find a solution.  In this case, Sun's samfs
> filesystem is definitely at fault and doing the wrong thing.  Backline
> engineers at Sun confirm this and are working on a fix.  
> 
> The reason for _this_ email isn't because of the ESTALEs, it's regarding the
> handling of the ESTALEs.  Right now I need the Solaris client behaviour to
> deal with this particular buggy server.
> 
> Incidentally, 2.6.10 never has a problem.  It's behaviour never creates ESTALEs in
> the first place.
> 
>  
> 
>>>It seems that the Solaris clients never report any such errors, only the Linux
>>>clients.  However, watching 'snoop' on the Solaris NFS server, I see that it IS
>>>returning stale file handles to both OSes, but Solaris clients seem to retry
>>>the request several times; and the Linux clients immediately pass the error up
>>>to the application.
>>>
>>>Is there some condition that the 2.4 kernel is handling incorrectly?
>>
>>I do not believe that Solaris redrives ESTALE on read, but they may do
>>it on open(). Linux does not redrive either case. See the many
>>discussions in the NFS list archives for why.
> 
> 
> Did you look at the 'snoop' bits in the previous email?  During that time, the
> process on the Solaris client is hanging in a write() call.
> 
> I'd be very happy to see any patches lieing around that might do this
> behaviour.  It would get me through the short term until Sun fixes this bug in
> samfs.
> 

The easiest, cleanest thing to do is retry the operation in the sys_* 
function.  Red Hat's latest kernel does this to prevent ESTALE.  I've 
attached the patch for your reference.  Mind you, due to the reasons 
you'll find in the list archives that Trond referenced, you'll not see 
this get into upstream kernels, so caveat emptor.

HTH
Neil


-- 
/***************************************************
  *Neil Horman
  *Software Engineer
  *Red Hat, Inc.
  *nhorman@redhat.com
  *gpg keyid: 1024D / 0x92A74FA1
  *http://pgp.mit.edu
  ***************************************************/