Stale mounts - very nasty

Linux NFS development
 help / color / mirror / Atom feed

* Stale mounts - very nasty
@ 2003-05-14 14:04 Heflin, Roger A.
  2003-05-14 15:29 ` Trond Myklebust
  0 siblings, 1 reply; 4+ messages in thread
From: Heflin, Roger A. @ 2003-05-14 14:04 UTC (permalink / raw)
  To: nfs

Guys,

I have one of these, where it went into this state in the middle of a =
copy,
ie part of the a file got copied and then the nfs partition when into =
this broken
state, all with the server doing nothing bad.

And I have now confirmed that it does it against Sun Solaris 8 servers, =
which
makes it look like a client problem, and not a server problem.

Any ideas on what is going on?  I cannot see any configuration mistake =
that
should cause this sort of behaviour under any conditions.

                              Roger

> "Heflin, Roger A." wrote:
> >=20
> > Basic problem:
> > stale nfs file handles.
> >=20
> > Conclusion:
> >=20
> > It looks like when the automounter umounts and if the server does =
not register
> > a "rpc.mountd: authenticated unmount request from" we get into this =
situation,
> > at least on a unused file systems.  I am not exactly sure what is =
happening
> > on the used filesystems.  This is on a high traffic setup with lots =
of mounts
> > and umounts and many many nodes, so given the high volume of =
mount/umounts
> > I would expect some requests to be dropped.
> >=20
> > It looks like when a umount is being done and the server is down or =
does not
> > confirm the umount that the client does not retry the umount and =
this
> > situation occurs, the situation is explained below.
> >=20
> > Does the above seem plausable?
> >=20
> > More information:
> >=20
> > Basic information, client is 2.4.21pre4 NFSALL (and 2.4.19 NFSALL), =
nfsutils
> > 1.0.1-1.
> >=20
> > When doing a df command we get this message in the messages file:
> >=20
> > nfs_statfs: statfs error =3D 116
> >=20
> > And the df looks like:
> >=20
> > hostname:/usr/applinux    0    1    0   0% /tmpmnt/usr/applinux
> >=20
> > Doing a umount /tmpmnt/usr/applinux fixes the problem (automounter =
remounts
> > it correctly).  I have had the problem happen with both automounter =
and fstab
> > mounted file systems, and I have had it happen with a Solaris 8 =
machine as
> > the server, so that argues to me that this is a client problem and =
not a
> > server problem. I have had it happen on  both 2.4.19 NFSALL and =
2.4.21pre4
> > NFSALL clients.
> >=20
> > The problem seems to happen without the server or obvious network =
issues going
> > on, though the problem also happens if the server reboots.   The =
server in
> > this case would be 2.4.19 NFSALL, and the mount entry is:
> >=20
> > hostname:/usr/applinux /tmpmnt/usr/applinux nfs rw,v3,
> > rsize=3D8192,wsize=3D8192,hard,intr,udp,lock,addr=3Dhostname 0 0
> >=20
> > It seems to happen quite a lot if the server reboots (a few out of a =
lot
> > of nodes have the issue), with a umount being required to fix it.  =
It does
> > not happen on all nodes (that we can tell, but it may happen on all =
nodes
> > that try to umount the down filesystem), just on some of the nodes.
> > It also will happen with the  client and server both up and ok =
without any
> > warning and without the server  rebooting or having anything funny =
done on it,
> > and it will only affect some  (1 usually) node.  I have had it do =
this while
> > the filesystems is being actively used (process have the fs open), =
in this case
> > the processes have to be killed and  then I umount the filesystems.
> >=20
> > It looks like a client problem of some sort, the network should be =
relatively
> > clean.
> >=20
> >                                                         Roger
>=20


-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Stale mounts - very nasty
  2003-05-14 14:04 Stale mounts - very nasty Heflin, Roger A.
@ 2003-05-14 15:29 ` Trond Myklebust
  0 siblings, 0 replies; 4+ messages in thread
From: Trond Myklebust @ 2003-05-14 15:29 UTC (permalink / raw)
  To: Heflin, Roger A.; +Cc: nfs

>>>>> " " == Roger A Heflin <Heflin> writes:

     > Guys, I have one of these, where it went into this state in the
     > middle of a copy, ie part of the a file got copied and then the
     > nfs partition when into this broken state, all with the server
     > doing nothing bad.

     > And I have now confirmed that it does it against Sun Solaris 8
     > servers, which makes it look like a client problem, and not a
     > server problem.

Mind showing us a tcpdump of a copy that fails? I don't buy that
'server doing nothing bad' theory without proof.

Under a reboot, the server may indeed use ESTALE to deny access to
files. Normally you ensure that this doesn't happen by killing nfsd
before you unexport the mounts, and by re-exporting before you start
nfsd (see the NFS list archives about the 'exportfs' ordering bugs in
older versions of nfs-utils).

If the server does return ESTALE, then the client is perfectly correct
in assuming that is a fatal error.

Cheers,
  Trond

-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Stale mounts - very nasty
@ 2003-05-14 16:27 Heflin, Roger A.
  2003-05-14 17:05 ` Trond Myklebust
  0 siblings, 1 reply; 4+ messages in thread
From: Heflin, Roger A. @ 2003-05-14 16:27 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

We don't have tcpdump of it, I don't know if we can get one,
but I will see.

The only thing I am coming up with is that we regenerate the exports
files and do a "exportfs -r" every so often (to automatically add new=20
hosts), and the failure *may* correspond with this happening, though
I can confirm that the exports file was the same before and after
in these cases, and that we never did a exportfs -r on a blank file
( I have copies all of the previous versions of the exports file)
I have checked and the problem seems to occur more often than
the above, but I am not absolutely sure.   I have tightened up that
script so it only replaces the exports file and does the update
when when the file is going to actually change, we will see if
this reduces the problem.

I have checked the startup order on the linux box and it is ok,
the rc script runs exportfs -r and then later starts the nfsd's.

The Sun startup is unknown it is using failover and other things
and has the script in a nonstandard place, I suspect that you
may be correct here, and that it may be starting nfs and=20
assuming the proper hostname to answer and reject nfs
mounts before the proper exportfs/shareall runs.  Someone=20
is going to carefully check this to verify its behavior.

If the Sun server export case is wrong, the Sun clients seems to still
retry after it gets a stale, as the sun clients have been comfirmed
to have received the stale message, but recovered without
intervention by any admins, or a reboot.   Can anyone else
comfirm that this appears to be the case?  This is with Solairs 8.

                              Roger

> -----Original Message-----
> From:	Trond Myklebust [SMTP:trond.myklebust@fys.uio.no]
> Sent:	Wednesday, May 14, 2003 10:29 AM
> To:	Heflin, Roger A.
> Cc:	nfs@lists.sourceforge.net
> Subject:	Re: [NFS] Stale mounts - very nasty
>=20
> >>>>> " " =3D=3D Roger A Heflin <Heflin> writes:
>=20
>      > Guys, I have one of these, where it went into this state in the
>      > middle of a copy, ie part of the a file got copied and then the
>      > nfs partition when into this broken state, all with the server
>      > doing nothing bad.
>=20
>      > And I have now confirmed that it does it against Sun Solaris 8
>      > servers, which makes it look like a client problem, and not a
>      > server problem.
>=20
> Mind showing us a tcpdump of a copy that fails? I don't buy that
> 'server doing nothing bad' theory without proof.
>=20
> Under a reboot, the server may indeed use ESTALE to deny access to
> files. Normally you ensure that this doesn't happen by killing nfsd
> before you unexport the mounts, and by re-exporting before you start
> nfsd (see the NFS list archives about the 'exportfs' ordering bugs in
> older versions of nfs-utils).
>=20
> If the server does return ESTALE, then the client is perfectly correct
> in assuming that is a fatal error.
>=20
> Cheers,
>   Trond

-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Stale mounts - very nasty
  2003-05-14 16:27 Heflin, Roger A.
@ 2003-05-14 17:05 ` Trond Myklebust
  0 siblings, 0 replies; 4+ messages in thread
From: Trond Myklebust @ 2003-05-14 17:05 UTC (permalink / raw)
  To: Heflin, Roger A.; +Cc: Trond Myklebust, nfs

>>>>> " " == Roger A Heflin <Heflin> writes:

     > If the Sun server export case is wrong, the Sun clients seems
     > to still retry after it gets a stale, as the sun clients have
     > been comfirmed to have received the stale message, but
     > recovered without intervention by any admins, or a reboot.  Can
     > anyone else comfirm that this appears to be the case?  This is
     > with Solairs 8.

They probably do.

I'm more careful because there are still servers which are not safe
when it comes to the issue of filehandle reuse (in particular there
are people who still insist on using the userspace nfs daemon).  On
such a server, the fact that a filehandle is suddenly accepted again
does not necessarily imply that we are accessing the same file (or
even the same type of file).

Cheers,
  Trond


-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-05-14 17:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-05-14 14:04 Stale mounts - very nasty Heflin, Roger A.
2003-05-14 15:29 ` Trond Myklebust
  -- strict thread matches above, loose matches on Subject: below --
2003-05-14 16:27 Heflin, Roger A.
2003-05-14 17:05 ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox