* Re: why do i get "Stale NFS file handle" for hours?
@ 2004-09-05 3:01 ` Trond Myklebust
0 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2004-09-05 3:01 UTC (permalink / raw)
To: Sven Köhler; +Cc: linux-kernel, nfs
På lau , 04/09/2004 klokka 22:23, skreiv Sven Köhler:
> Sorry? Why is my server broken? I'm using kernel 2.6.8.1 with nfs-utils
> 1.0.6 on my server, and i don't see, what should be broken.
When your server fails to work as per spec, then it is said to be
"broken" no matter what kernel/nfs-utils combination you are using.
The spec is that reboots are not supposed to clobber filehandles.
So, there are 3 possibilities:
1) You are exporting a non-supported filesystem, (e.g. FAT). See the
FAQ on http://nfs.sourceforge.org.
2) A bug in your initscripts is causing the table of exports to be
clobbered. Running "exportfs" in legacy 2.4 mode (without having the
nfsd filesystem mounted on /proc/fs/nfsd) appears to be broken for me at
least...
3) There is some other bug in knfsd that nobody else appears to be
seeing.
Cheers,
Trond
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
2004-09-05 3:01 ` Trond Myklebust
@ 2004-09-05 8:17 ` Tim Connors
-1 siblings, 0 replies; 24+ messages in thread
From: Tim Connors @ 2004-09-05 8:17 UTC (permalink / raw)
To: Trond Myklebust; +Cc: Sven Köhler, linux-kernel, nfs
Trond Myklebust <trond.myklebust@fys.uio.no> said on Sat, 04 Sep 2004 2=
3:01:07 -0400:
> P=E5 lau , 04/09/2004 klokka 22:23, skreiv Sven K=F6hler:
>=20
> > Sorry? Why is my server broken? I'm using kernel 2.6.8.1 with nfs-u=
tils=20
> > 1.0.6 on my server, and i don't see, what should be broken.
>=20
> When your server fails to work as per spec, then it is said to be
> "broken" no matter what kernel/nfs-utils combination you are using.
> The spec is that reboots are not supposed to clobber filehandles.
>=20
> So, there are 3 possibilities:
>=20
> 1) You are exporting a non-supported filesystem, (e.g. FAT). See the
> FAQ on http://nfs.sourceforge.org.
> 2) A bug in your initscripts is causing the table of exports to be
> clobbered. Running "exportfs" in legacy 2.4 mode (without having the
> nfsd filesystem mounted on /proc/fs/nfsd) appears to be broken for me=
at
> least...
> 3) There is some other bug in knfsd that nobody else appears to be
> seeing.
Have I got 2 cases of 3) for you perhaps?
I can't give you more info, because I am not the admin of the boxes
concerned, but we lose filehandles of specific files and spontaneously
sometimes (no server reboots, nfsd restarts, etc).
Background:
We have a compute cluster of machines all running SuSE's 2.4.20, or
thereabouts. The nfs servers run Linus's 2.4.26, talking to ext3, on
bigass apple Xserves.
I will update one directory with rsync from one host, and then try, a
little later on, to operate on that directory from another host. Every
now and then, from a single host only, a few files in that tree will
get stale filehandles - an ls of that directory will mostly be fine
apart from those files. They will also be fine from any other machine.
I have found that if I clobber cache with my alloclargemem program,
then those files will come back immediately.
The other problem we see regularly, and I have encoded explicitly into
my scripts to workaround, because it is such a common occurence, is
when I start 120 jobs in a short time on 120 nodes, which deal with a
bunch of common files read-only, and then write their own private
files, a few of them will die with the read-only files being stale. It
looks as if the server just can't cope with a hundred requests (and
possibly mounts, since they are automounted) in the space of half a
minute (big files, mind you), and starts returning bogus data.
The entire mount (which is automounted, looks like version 3) will
then remain stale for eternity, with df returning its minus 3
bazillion GB free, until automount is restarted.
Known problems? I googled for '"stale nfs file handle" spontaneous'
with no luck. Or is it likely perhaps that SuSE fscked with the nfs
(and autofs) client side code? The sysadmins look at these failures as
being a fact of life, but perhaps no-one else is seeing this, so it's
worth reporting.
--=20
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
PUBLIC NOTICE AS REQUIRED BY LAW: Any Use of This Product, in Any Manne=
r=20
Whatsoever, Will Increase the Amount of Disorder in the Universe. Altho=
ugh No=20
Liability Is Implied Herein, the Consumer Is Warned That This Process W=
ill=20
Ultimately Lead to the Heat Death of the Universe.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
@ 2004-09-05 8:17 ` Tim Connors
0 siblings, 0 replies; 24+ messages in thread
From: Tim Connors @ 2004-09-05 8:17 UTC (permalink / raw)
To: Trond Myklebust; +Cc: Sven Köhler, linux-kernel, nfs
Trond Myklebust <trond.myklebust@fys.uio.no> said on Sat, 04 Sep 2004 23:01:07 -0400:
> På lau , 04/09/2004 klokka 22:23, skreiv Sven Köhler:
>
> > Sorry? Why is my server broken? I'm using kernel 2.6.8.1 with nfs-utils
> > 1.0.6 on my server, and i don't see, what should be broken.
>
> When your server fails to work as per spec, then it is said to be
> "broken" no matter what kernel/nfs-utils combination you are using.
> The spec is that reboots are not supposed to clobber filehandles.
>
> So, there are 3 possibilities:
>
> 1) You are exporting a non-supported filesystem, (e.g. FAT). See the
> FAQ on http://nfs.sourceforge.org.
> 2) A bug in your initscripts is causing the table of exports to be
> clobbered. Running "exportfs" in legacy 2.4 mode (without having the
> nfsd filesystem mounted on /proc/fs/nfsd) appears to be broken for me at
> least...
> 3) There is some other bug in knfsd that nobody else appears to be
> seeing.
Have I got 2 cases of 3) for you perhaps?
I can't give you more info, because I am not the admin of the boxes
concerned, but we lose filehandles of specific files and spontaneously
sometimes (no server reboots, nfsd restarts, etc).
Background:
We have a compute cluster of machines all running SuSE's 2.4.20, or
thereabouts. The nfs servers run Linus's 2.4.26, talking to ext3, on
bigass apple Xserves.
I will update one directory with rsync from one host, and then try, a
little later on, to operate on that directory from another host. Every
now and then, from a single host only, a few files in that tree will
get stale filehandles - an ls of that directory will mostly be fine
apart from those files. They will also be fine from any other machine.
I have found that if I clobber cache with my alloclargemem program,
then those files will come back immediately.
The other problem we see regularly, and I have encoded explicitly into
my scripts to workaround, because it is such a common occurence, is
when I start 120 jobs in a short time on 120 nodes, which deal with a
bunch of common files read-only, and then write their own private
files, a few of them will die with the read-only files being stale. It
looks as if the server just can't cope with a hundred requests (and
possibly mounts, since they are automounted) in the space of half a
minute (big files, mind you), and starts returning bogus data.
The entire mount (which is automounted, looks like version 3) will
then remain stale for eternity, with df returning its minus 3
bazillion GB free, until automount is restarted.
Known problems? I googled for '"stale nfs file handle" spontaneous'
with no luck. Or is it likely perhaps that SuSE fscked with the nfs
(and autofs) client side code? The sysadmins look at these failures as
being a fact of life, but perhaps no-one else is seeing this, so it's
worth reporting.
--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
PUBLIC NOTICE AS REQUIRED BY LAW: Any Use of This Product, in Any Manner
Whatsoever, Will Increase the Amount of Disorder in the Universe. Although No
Liability Is Implied Herein, the Consumer Is Warned That This Process Will
Ultimately Lead to the Heat Death of the Universe.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
2004-09-05 8:17 ` Tim Connors
(?)
@ 2004-09-05 8:59 ` Florian Weimer
2004-09-05 9:02 ` Tim Connors
-1 siblings, 1 reply; 24+ messages in thread
From: Florian Weimer @ 2004-09-05 8:59 UTC (permalink / raw)
To: Tim Connors; +Cc: linux-kernel, nfs
* Tim Connors:
> Background:
>
> We have a compute cluster of machines all running SuSE's 2.4.20, or
> thereabouts. The nfs servers run Linus's 2.4.26, talking to ext3, on
> bigass apple Xserves.
Which NFS server software are you using?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
2004-09-05 8:17 ` Tim Connors
(?)
(?)
@ 2004-09-05 16:20 ` Mike Jagdis
2004-09-06 1:32 ` Tim Connors
-1 siblings, 1 reply; 24+ messages in thread
From: Mike Jagdis @ 2004-09-05 16:20 UTC (permalink / raw)
To: Tim Connors; +Cc: Trond Myklebust, Sven Köhler, linux-kernel, nfs
Tim Connors wrote:
> I will update one directory with rsync from one host,
You mean rsync to the server and change files directly on the fs rather
than through an NFS client?
> and then try, a
> little later on, to operate on that directory from another host. Every
> now and then, from a single host only, a few files in that tree will
> get stale filehandles - an ls of that directory will mostly be fine
> apart from those files. They will also be fine from any other machine.
Yeah, that's what happens... Clients that had the file open are liable
to get ESTALE. Stale file handles stick around until unmount. As long as
they're around automount will consider the mount busy and not expire it
(but you can unmount manually or killall -USR1 automountd).
Mike
--
Mike Jagdis Web: http://www.eris-associates.co.uk
Eris Associates Limited Tel: +44 7780 608 368
Reading, England Fax: +44 118 926 6974
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
2004-09-05 16:20 ` Mike Jagdis
@ 2004-09-06 1:32 ` Tim Connors
0 siblings, 0 replies; 24+ messages in thread
From: Tim Connors @ 2004-09-06 1:32 UTC (permalink / raw)
To: Mike Jagdis
Cc: Trond Myklebust, Sven Köhler, Linux Kernel Mailing List, nfs
On Sun, 5 Sep 2004, Mike Jagdis wrote:
> Tim Connors wrote:
> > I will update one directory with rsync from one host,
>
> You mean rsync to the server and change files directly on the fs rather
> than through an NFS client?
No - the server is behind a firewall. Just an ordinary nfs client.
> > and then try, a
> > little later on, to operate on that directory from another host. Every
> > now and then, from a single host only, a few files in that tree will
> > get stale filehandles - an ls of that directory will mostly be fine
> > apart from those files. They will also be fine from any other machine.
>
> Yeah, that's what happens... Clients that had the file open are liable
> to get ESTALE. Stale file handles stick around until unmount. As long as
> they're around automount will consider the mount busy and not expire it
> (but you can unmount manually or killall -USR1 automountd).
Yep - that has been the case normally (when the entire mount went stale),
we'd just restart the automounter.
You almost hit the nail on the head with regards to the problem - this
last happened a week ago, and I seem to remember 6 files getting ESTALE.
But only 2 of those would have likely been open on the host where they
went stale, at any time near when they went stale (if they were open at
all), if I am remembering things right. Unless an `ls -lA --color` counts
as "opening" (they weren't symlinks, just normal files, so I doubt it).
What is strange, is I was able to make them "unstale" simply by clearing
cache - allocating a large block of ram, and ensuring buffers and cached
went to something very small. I didn't need to restart the automounter at
all. Then, I could `ls` the directory fine, and could `cat` the files
fine.
I'm afraid that the intermittent nature of this problem is going to make
it hard for me to reproduce though!
I take it the files go stale (normally) because sillyrename only happens
when 1 host tries to delete while the same host has the file open, so the
server doesn't know that a client still has it open, and if the inode just
happens to be allocated by something new, then the server has no choice
but to say "bugger off"? I thought I had seen in the past that you could
delete a file from one host, have another host still be using the file,
and it would do the sillyrename, and the client would continue to use the
file just fine - probably was on a Sun, come to think of it -- does it's
equivalent of sillyrename keep track of who has what open?
--
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
"Meddle not in the affairs of cats, for they are subtle, and will
piss on your computer." - Jeff Wilder
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
2004-09-05 3:01 ` Trond Myklebust
(?)
(?)
@ 2004-09-05 13:18 ` Sven Köhler
2004-09-05 20:10 ` Trond Myklebust
-1 siblings, 1 reply; 24+ messages in thread
From: Sven Köhler @ 2004-09-05 13:18 UTC (permalink / raw)
To: Trond Myklebust; +Cc: linux-kernel, nfs
> So, there are 3 possibilities:
>
> 1) You are exporting a non-supported filesystem, (e.g. FAT). See the
> FAQ on http://nfs.sourceforge.org.
I'm exporting a reiserfs.
> 2) A bug in your initscripts is causing the table of exports to be
> clobbered. Running "exportfs" in legacy 2.4 mode (without having the
> nfsd filesystem mounted on /proc/fs/nfsd) appears to be broken for me at
> least...
So there should be a filesystem mounted to /proc/fs/nfsd? This isn't the
case on my machine. Should the init-script do a simple "mount -t nfsd
none /proc/fs/nfsd"? Than this would be a Bug of my distribution (Gentoo).
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
2004-09-05 13:18 ` Sven Köhler
@ 2004-09-05 20:10 ` Trond Myklebust
0 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2004-09-05 20:10 UTC (permalink / raw)
To: Sven Köhler; +Cc: linux-kernel, nfs
P=E5 su , 05/09/2004 klokka 09:18, skreiv Sven K=F6hler:
> So there should be a filesystem mounted to /proc/fs/nfsd? This isn't the=20
> case on my machine. Should the init-script do a simple "mount -t nfsd=20
> none /proc/fs/nfsd"? Than this would be a Bug of my distribution (Gentoo)=
.
Yes... See the manpage for "exportfs".
Cheers,
Trond
-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
@ 2004-09-05 20:10 ` Trond Myklebust
0 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2004-09-05 20:10 UTC (permalink / raw)
To: Sven Köhler; +Cc: linux-kernel, nfs
På su , 05/09/2004 klokka 09:18, skreiv Sven Köhler:
> So there should be a filesystem mounted to /proc/fs/nfsd? This isn't the
> case on my machine. Should the init-script do a simple "mount -t nfsd
> none /proc/fs/nfsd"? Than this would be a Bug of my distribution (Gentoo).
Yes... See the manpage for "exportfs".
Cheers,
Trond
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
2004-09-05 20:10 ` Trond Myklebust
(?)
@ 2004-09-06 7:47 ` Kalin KOZHUHAROV
-1 siblings, 0 replies; 24+ messages in thread
From: Kalin KOZHUHAROV @ 2004-09-06 7:47 UTC (permalink / raw)
To: linux-kernel; +Cc: nfs
Trond Myklebust wrote:
>>So there should be a filesystem mounted to /proc/fs/nfsd? This isn't the
>>case on my machine. Should the init-script do a simple "mount -t nfsd
>>none /proc/fs/nfsd"? Than this would be a Bug of my distribution (Gentoo).
Well, I am on Gentoo as well, and it seems that it is mounted on /proc/fs/nfs.
However `cat /proc/fs/nfs/exports` showed only one of 5 exported dirs on my server.
It has been a few weeks since last restart (and NFS restart).
`/etc/init.d/nfs restart` or `exportfs -a` fixed it.
> Yes... See the manpage for "exportfs".
Had a (first) look at it, but I still cannod understand what is the difference
between the "-r" and "-a" option...
The output on my system from both `exportfs -rv` and `exportfs -av` is the same.
Kalin.
--
|| ~~~~~~~~~~~~~~~~~~~~~~ ||
( ) http://ThinRope.net/ ( )
|| ______________________ ||
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
2004-09-05 3:01 ` Trond Myklebust
` (2 preceding siblings ...)
(?)
@ 2004-09-06 9:57 ` David Woodhouse
2004-09-06 15:59 ` Trond Myklebust
-1 siblings, 1 reply; 24+ messages in thread
From: David Woodhouse @ 2004-09-06 9:57 UTC (permalink / raw)
To: Trond Myklebust; +Cc: Sven Köhler, linux-kernel, nfs
On Sat, 2004-09-04 at 23:01 -0400, Trond Myklebust wrote:
> 2) A bug in your initscripts is causing the table of exports to be
> clobbered. Running "exportfs" in legacy 2.4 mode (without having the
> nfsd filesystem mounted on /proc/fs/nfsd) appears to be broken for me at
> least...
> 3) There is some other bug in knfsd that nobody else appears to be
> seeing.
The fact that we require a persistent table of exports at all, and can't
call back to mountd to authenticate 'new' clients instead of just
telling them to sod off if the kernel doesn't already know about them,
is considered by some to be a bug in knfsd.
--
dwmw2
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: why do i get "Stale NFS file handle" for hours?
2004-09-06 9:57 ` David Woodhouse
@ 2004-09-06 15:59 ` Trond Myklebust
0 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2004-09-06 15:59 UTC (permalink / raw)
To: David Woodhouse; +Cc: Sven Köhler, linux-kernel, nfs
P=E5 m=E5 , 06/09/2004 klokka 05:57, skreiv David Woodhouse:
> The fact that we require a persistent table of exports at all, and can't
> call back to mountd to authenticate 'new' clients instead of just
> telling them to sod off if the kernel doesn't already know about them,
> is considered by some to be a bug in knfsd.=20
That should have been fixed in 2.6.x. If you do mount /proc/fs/nfsd, and
use a recent enough version of mountd, then knfsd can and will work
without any extra help from exportfs.
The one problem I have found with this implementation is that it relies
very heavily on reverse-DNS lookups, so it may give unexpected results
if you have more than one name for your client. I can't see why that
shouldn't be fixable, though...
Cheers,
Trond
-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why do i get "Stale NFS file handle" for hours?
@ 2004-09-06 15:59 ` Trond Myklebust
0 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2004-09-06 15:59 UTC (permalink / raw)
To: David Woodhouse; +Cc: Sven Köhler, linux-kernel, nfs
På må , 06/09/2004 klokka 05:57, skreiv David Woodhouse:
> The fact that we require a persistent table of exports at all, and can't
> call back to mountd to authenticate 'new' clients instead of just
> telling them to sod off if the kernel doesn't already know about them,
> is considered by some to be a bug in knfsd.
That should have been fixed in 2.6.x. If you do mount /proc/fs/nfsd, and
use a recent enough version of mountd, then knfsd can and will work
without any extra help from exportfs.
The one problem I have found with this implementation is that it relies
very heavily on reverse-DNS lookups, so it may give unexpected results
if you have more than one name for your client. I can't see why that
shouldn't be fixable, though...
Cheers,
Trond
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NFS] Re: why do i get "Stale NFS file handle" for hours?
2004-09-05 3:01 ` Trond Myklebust
` (3 preceding siblings ...)
(?)
@ 2004-09-07 0:55 ` Greg Banks
-1 siblings, 0 replies; 24+ messages in thread
From: Greg Banks @ 2004-09-07 0:55 UTC (permalink / raw)
To: Trond Myklebust; +Cc: Sven Köhler, linux-kernel, Linux NFS Mailing List
On Sun, 2004-09-05 at 13:01, Trond Myklebust wrote:
> When your server fails to work as per spec, then it is said to be
> "broken" no matter what kernel/nfs-utils combination you are using.
> The spec is that reboots are not supposed to clobber filehandles.
>
> So, there are 3 possibilities:
>
> 1) You are exporting a non-supported filesystem, (e.g. FAT). See the
> FAQ on http://nfs.sourceforge.org.
> 2) A bug in your initscripts is causing the table of exports to be
> clobbered. Running "exportfs" in legacy 2.4 mode (without having the
> nfsd filesystem mounted on /proc/fs/nfsd) appears to be broken for me at
> least...
> 3) There is some other bug in knfsd that nobody else appears to be
> seeing.
>
4) You're exporting a filesystem mounted on a block device whose
device minor number is dynamic and has changed at the last reboot,
e.g. loopback mounts or SCSI.
5) The mapping of minor numbers is stable but you physically re-arranged
the disks or SCSI cards and changed /etc/fstab correspondingly.
Before you say any more, yes this is broken and fixing it properly is
Hard. This is why have the fsid export option.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
^ permalink raw reply [flat|nested] 24+ messages in thread