RE: Linux NFS and cached properties

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: Linux NFS and cached properties
       [not found] ` <20120724143748.GC8570@fieldses.org>
@ 2012-07-24 17:28   ` ZUIDAM, Hans
  2012-07-26 22:36     ` J. Bruce Fields
  0 siblings, 1 reply; 10+ messages in thread
From: ZUIDAM, Hans @ 2012-07-24 17:28 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs@vger.kernel.org, neilb@suse.de, DE WITTE, PETER

Hi Bruce,

Thanks for the clarification.

(I'm repeating a lot of my original mail because of the Cc: list.)

> J. Bruce Fields
> I think that's right, though I'm curious how you're managing to hit
> that case reliably every time.  Or is this an intermittent failure?
It's an intermittent failure, but with the procedure shown below it is
fairly easy to reproduce.    The actual problem we see in our product
is because of the way external storage media are handled in user-land.

        192.168.1.10# mount -t xfs /dev/sdcr/sda1 /mnt
        192.168.1.10# exportfs 192.168.1.11:/mnt

        192.168.1.11# mount 192.168.1.10:/mnt /mnt
        192.168.1.11# umount /mnt

        192.168.1.10# exportfs -u 192.168.1.11:/mnt
        192.168.1.10# umount /mnt
        umount: can't umount /media/recdisk: Device or resource busy

What I actually do is the mount/unmount on the client via ssh.  That
is a good way to trigger the problem.

We see that during the un-export the NFS caches are not flushed
properly which is why the final unmount fails.

In net/sunrpc/cache.c the cache times (last_refresh, expiry_time,
flush_time) are measured in seconds.  If I understand the code somewhat
then during an NFS un-export the is done by setting the flush_time to
the current time.  The cache_flush() is called.  If in that same second
last_refresh is set to the current time then the cached item is not
flushed.  This will subsequently cause un-mount to fail because there
is still a reference to the mount point.

> J. Bruce Fields
> I ran across that recently while reviewing the code to fix a related
> problem.  I'm not sure what the best fix would be.
>
> Previously raised here:
>
>       http://marc.info/?l=linux-nfs&m=133514319408283&w=2

The description in your mail does indeed looks the same as the problem
that we see.

>From reading the code in net/sunrpc/cache.c I get the impression that it is
not really possible to reliably flush the caches for an un-exportfs such
that after flushing they will not accept entries for the un-exported IP/mount
point combination.

With kind regards,
Hans Zuidam

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Linux NFS and cached properties
  2012-07-24 17:28   ` Linux NFS and cached properties ZUIDAM, Hans
@ 2012-07-26 22:36     ` J. Bruce Fields
  2012-07-31  5:08       ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: J. Bruce Fields @ 2012-07-26 22:36 UTC (permalink / raw)
  To: ZUIDAM, Hans; +Cc: linux-nfs@vger.kernel.org, neilb@suse.de, DE WITTE, PETER

On Tue, Jul 24, 2012 at 05:28:02PM +0000, ZUIDAM, Hans wrote:
> Hi Bruce,
> 
> Thanks for the clarification.
> 
> (I'm repeating a lot of my original mail because of the Cc: list.)
> 
> > J. Bruce Fields
> > I think that's right, though I'm curious how you're managing to hit
> > that case reliably every time.  Or is this an intermittent failure?
> It's an intermittent failure, but with the procedure shown below it is
> fairly easy to reproduce.    The actual problem we see in our product
> is because of the way external storage media are handled in user-land.
> 
>         192.168.1.10# mount -t xfs /dev/sdcr/sda1 /mnt
>         192.168.1.10# exportfs 192.168.1.11:/mnt
> 
>         192.168.1.11# mount 192.168.1.10:/mnt /mnt
>         192.168.1.11# umount /mnt
> 
>         192.168.1.10# exportfs -u 192.168.1.11:/mnt
>         192.168.1.10# umount /mnt
>         umount: can't umount /media/recdisk: Device or resource busy
> 
> What I actually do is the mount/unmount on the client via ssh.  That
> is a good way to trigger the problem.
> 
> We see that during the un-export the NFS caches are not flushed
> properly which is why the final unmount fails.
> 
> In net/sunrpc/cache.c the cache times (last_refresh, expiry_time,
> flush_time) are measured in seconds.  If I understand the code somewhat
> then during an NFS un-export the is done by setting the flush_time to
> the current time.  The cache_flush() is called.  If in that same second
> last_refresh is set to the current time then the cached item is not
> flushed.  This will subsequently cause un-mount to fail because there
> is still a reference to the mount point.
> 
> > J. Bruce Fields
> > I ran across that recently while reviewing the code to fix a related
> > problem.  I'm not sure what the best fix would be.
> >
> > Previously raised here:
> >
> >       http://marc.info/?l=linux-nfs&m=133514319408283&w=2
> 
> The description in your mail does indeed looks the same as the problem
> that we see.
> 
> >From reading the code in net/sunrpc/cache.c I get the impression that it is
> not really possible to reliably flush the caches for an un-exportfs such
> that after flushing they will not accept entries for the un-exported IP/mount
> point combination.

Right.  So, possible ideas, from that previous message:

	- As Neil suggests, modify exportfs to wait a second between
	  updating etab and flushing the cache.  At that point any
	  entries still using the old information are at least a second
	  old.  That may be adequate for your case, but if someone out
	  there is sensitive to the time required to unexport then that
	  will annoy them.  It also leaves the small possibility of
	  races where an in-progress rpc may still be using an export at
	  the time you try to flush.
	- Implement some new interface that you can use to flush the
	  cache and that doesn't return until in-progress rpc's
	  complete.  Since it waits for rpc's it's not purely a "cache"
	  layer interface any more.  So maybe something like
	  /proc/fs/nfsd/flush_exports.
	- As a workaround requiring no code changes: unexport, then shut
	  down the server entirely and restart it.  Clients will see
	  that as a reboot recovery event and recover automatically, but
	  applications may see delays while that happens.  Kind of a big
	  hammer, but if unexporting while other exports are in use is
	  rare maybe it would be adequate for your case.

--b.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Linux NFS and cached properties
  2012-07-26 22:36     ` J. Bruce Fields
@ 2012-07-31  5:08       ` NeilBrown
  2012-07-31 12:25         ` J. Bruce Fields
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2012-07-31  5:08 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: ZUIDAM, Hans, linux-nfs@vger.kernel.org, DE WITTE, PETER

[-- Attachment #1: Type: text/plain, Size: 4700 bytes --]

On Thu, 26 Jul 2012 18:36:07 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Tue, Jul 24, 2012 at 05:28:02PM +0000, ZUIDAM, Hans wrote:
> > Hi Bruce,
> > 
> > Thanks for the clarification.
> > 
> > (I'm repeating a lot of my original mail because of the Cc: list.)
> > 
> > > J. Bruce Fields
> > > I think that's right, though I'm curious how you're managing to hit
> > > that case reliably every time.  Or is this an intermittent failure?
> > It's an intermittent failure, but with the procedure shown below it is
> > fairly easy to reproduce.    The actual problem we see in our product
> > is because of the way external storage media are handled in user-land.
> > 
> >         192.168.1.10# mount -t xfs /dev/sdcr/sda1 /mnt
> >         192.168.1.10# exportfs 192.168.1.11:/mnt
> > 
> >         192.168.1.11# mount 192.168.1.10:/mnt /mnt
> >         192.168.1.11# umount /mnt
> > 
> >         192.168.1.10# exportfs -u 192.168.1.11:/mnt
> >         192.168.1.10# umount /mnt
> >         umount: can't umount /media/recdisk: Device or resource busy
> > 
> > What I actually do is the mount/unmount on the client via ssh.  That
> > is a good way to trigger the problem.
> > 
> > We see that during the un-export the NFS caches are not flushed
> > properly which is why the final unmount fails.
> > 
> > In net/sunrpc/cache.c the cache times (last_refresh, expiry_time,
> > flush_time) are measured in seconds.  If I understand the code somewhat
> > then during an NFS un-export the is done by setting the flush_time to
> > the current time.  The cache_flush() is called.  If in that same second
> > last_refresh is set to the current time then the cached item is not
> > flushed.  This will subsequently cause un-mount to fail because there
> > is still a reference to the mount point.
> > 
> > > J. Bruce Fields
> > > I ran across that recently while reviewing the code to fix a related
> > > problem.  I'm not sure what the best fix would be.
> > >
> > > Previously raised here:
> > >
> > >       http://marc.info/?l=linux-nfs&m=133514319408283&w=2
> > 
> > The description in your mail does indeed looks the same as the problem
> > that we see.
> > 
> > >From reading the code in net/sunrpc/cache.c I get the impression that it is
> > not really possible to reliably flush the caches for an un-exportfs such
> > that after flushing they will not accept entries for the un-exported IP/mount
> > point combination.
> 
> Right.  So, possible ideas, from that previous message:
> 
> 	- As Neil suggests, modify exportfs to wait a second between
> 	  updating etab and flushing the cache.  At that point any
> 	  entries still using the old information are at least a second
> 	  old.  That may be adequate for your case, but if someone out
> 	  there is sensitive to the time required to unexport then that
> 	  will annoy them.  It also leaves the small possibility of
> 	  races where an in-progress rpc may still be using an export at
> 	  the time you try to flush.
> 	- Implement some new interface that you can use to flush the
> 	  cache and that doesn't return until in-progress rpc's
> 	  complete.  Since it waits for rpc's it's not purely a "cache"
> 	  layer interface any more.  So maybe something like
> 	  /proc/fs/nfsd/flush_exports.
> 	- As a workaround requiring no code changes: unexport, then shut
> 	  down the server entirely and restart it.  Clients will see
> 	  that as a reboot recovery event and recover automatically, but
> 	  applications may see delays while that happens.  Kind of a big
> 	  hammer, but if unexporting while other exports are in use is
> 	  rare maybe it would be adequate for your case.

That's a shame...
I had originally intended "rpc.nfsd 0" to simple stop all threads and nothing
else.  Then you would be able to:
   rpc.nfsd 0
   exportfs -f
   unmount
   rpc.nfsd 16

and have a nice fast race-free unmount.
But commit e096bbc6488d3e49d476bf986d33752709361277 'fixed' that :-(

I wonder if it can be resurrected ... maybe not worth the effort.


The idea of a new interface to synchronise with all threads has potential and
doesn't need to be at the nfsd level - it could be in sunrpc.  Maybe it could
be built into the current 'flush' interface.
1/ iterate through all no-sleeping threads setting a flag an increasing a
counter.
2/ when a thread completes current request, if test_and_clear the flag, it
atomic_dec_and_test the counter and then wakes up some wait_queue_head.
3/ 'flush'ing thread waits on the waut_queue_head for the counter to be 0.

If you don't hate it I could possibly even provide some code.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Linux NFS and cached properties
  2012-07-31  5:08       ` NeilBrown
@ 2012-07-31 12:25         ` J. Bruce Fields
  2012-07-31 12:45           ` J. Bruce Fields
  2012-08-02  0:04           ` NeilBrown
  0 siblings, 2 replies; 10+ messages in thread
From: J. Bruce Fields @ 2012-07-31 12:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: ZUIDAM, Hans, linux-nfs@vger.kernel.org, DE WITTE, PETER

On Tue, Jul 31, 2012 at 03:08:01PM +1000, NeilBrown wrote:
> On Thu, 26 Jul 2012 18:36:07 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> wrote:
> 
> > On Tue, Jul 24, 2012 at 05:28:02PM +0000, ZUIDAM, Hans wrote:
> > > Hi Bruce,
> > > 
> > > Thanks for the clarification.
> > > 
> > > (I'm repeating a lot of my original mail because of the Cc: list.)
> > > 
> > > > J. Bruce Fields
> > > > I think that's right, though I'm curious how you're managing to hit
> > > > that case reliably every time.  Or is this an intermittent failure?
> > > It's an intermittent failure, but with the procedure shown below it is
> > > fairly easy to reproduce.    The actual problem we see in our product
> > > is because of the way external storage media are handled in user-land.
> > > 
> > >         192.168.1.10# mount -t xfs /dev/sdcr/sda1 /mnt
> > >         192.168.1.10# exportfs 192.168.1.11:/mnt
> > > 
> > >         192.168.1.11# mount 192.168.1.10:/mnt /mnt
> > >         192.168.1.11# umount /mnt
> > > 
> > >         192.168.1.10# exportfs -u 192.168.1.11:/mnt
> > >         192.168.1.10# umount /mnt
> > >         umount: can't umount /media/recdisk: Device or resource busy
> > > 
> > > What I actually do is the mount/unmount on the client via ssh.  That
> > > is a good way to trigger the problem.
> > > 
> > > We see that during the un-export the NFS caches are not flushed
> > > properly which is why the final unmount fails.
> > > 
> > > In net/sunrpc/cache.c the cache times (last_refresh, expiry_time,
> > > flush_time) are measured in seconds.  If I understand the code somewhat
> > > then during an NFS un-export the is done by setting the flush_time to
> > > the current time.  The cache_flush() is called.  If in that same second
> > > last_refresh is set to the current time then the cached item is not
> > > flushed.  This will subsequently cause un-mount to fail because there
> > > is still a reference to the mount point.
> > > 
> > > > J. Bruce Fields
> > > > I ran across that recently while reviewing the code to fix a related
> > > > problem.  I'm not sure what the best fix would be.
> > > >
> > > > Previously raised here:
> > > >
> > > >       http://marc.info/?l=linux-nfs&m=133514319408283&w=2
> > > 
> > > The description in your mail does indeed looks the same as the problem
> > > that we see.
> > > 
> > > >From reading the code in net/sunrpc/cache.c I get the impression that it is
> > > not really possible to reliably flush the caches for an un-exportfs such
> > > that after flushing they will not accept entries for the un-exported IP/mount
> > > point combination.
> > 
> > Right.  So, possible ideas, from that previous message:
> > 
> > 	- As Neil suggests, modify exportfs to wait a second between
> > 	  updating etab and flushing the cache.  At that point any
> > 	  entries still using the old information are at least a second
> > 	  old.  That may be adequate for your case, but if someone out
> > 	  there is sensitive to the time required to unexport then that
> > 	  will annoy them.  It also leaves the small possibility of
> > 	  races where an in-progress rpc may still be using an export at
> > 	  the time you try to flush.
> > 	- Implement some new interface that you can use to flush the
> > 	  cache and that doesn't return until in-progress rpc's
> > 	  complete.  Since it waits for rpc's it's not purely a "cache"
> > 	  layer interface any more.  So maybe something like
> > 	  /proc/fs/nfsd/flush_exports.
> > 	- As a workaround requiring no code changes: unexport, then shut
> > 	  down the server entirely and restart it.  Clients will see
> > 	  that as a reboot recovery event and recover automatically, but
> > 	  applications may see delays while that happens.  Kind of a big
> > 	  hammer, but if unexporting while other exports are in use is
> > 	  rare maybe it would be adequate for your case.
> 
> That's a shame...
> I had originally intended "rpc.nfsd 0" to simple stop all threads and nothing
> else.  Then you would be able to:
>    rpc.nfsd 0
>    exportfs -f
>    unmount
>    rpc.nfsd 16
> 
> and have a nice fast race-free unmount.
> But commit e096bbc6488d3e49d476bf986d33752709361277 'fixed' that :-(
> 
> I wonder if it can be resurrected ... maybe not worth the effort.

That also shut down v4 state.  Making the clients recover would
typically be more expensive than ditching the export table.  (Did it
also throw out NLM locks?  I can't tell on a quick check.)

> The idea of a new interface to synchronise with all threads has potential and
> doesn't need to be at the nfsd level - it could be in sunrpc.  Maybe it could
> be built into the current 'flush' interface.

We need to keep compatible behavior to prevent deadlocks.  (Don't want
nfsd waiting on mountd waiting on nfsd.)

Looks like write_flush currently returns -EINVAL to anything that's not
an integer.  So exportfs could write something new and ignore the error
return (or try some other workaround) in the case of an old kernel.

> 1/ iterate through all no-sleeping threads setting a flag an increasing a
> counter.
> 2/ when a thread completes current request, if test_and_clear the flag, it
> atomic_dec_and_test the counter and then wakes up some wait_queue_head.
> 3/ 'flush'ing thread waits on the waut_queue_head for the counter to be 0.
> 
> If you don't hate it I could possibly even provide some code.

That sounds reasonable to me.  So you'd just add a single such
thread-synchronization after modifying mountd's idea of the export
table, ok.

It still wouldn't allow an unmount in the case a client held an NSM lock
or v4 open--but I think that's what we want.  If somebody wants a way to
unmount even in the presence of such state, then they really need to do
a complete shutdown.

I wonder if there's also still a use for an operation that stops all
threads temporarily but doesn't toss any state or caches?  I'm not
coming up with one off the top of my head.

--b.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Linux NFS and cached properties
  2012-07-31 12:25         ` J. Bruce Fields
@ 2012-07-31 12:45           ` J. Bruce Fields
  2012-07-31 14:07             ` J. Bruce Fields
  2012-08-02  0:04           ` NeilBrown
  1 sibling, 1 reply; 10+ messages in thread
From: J. Bruce Fields @ 2012-07-31 12:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: ZUIDAM, Hans, linux-nfs@vger.kernel.org, DE WITTE, PETER

On Tue, Jul 31, 2012 at 08:25:46AM -0400, J. Bruce Fields wrote:
> On Tue, Jul 31, 2012 at 03:08:01PM +1000, NeilBrown wrote:
> > The idea of a new interface to synchronise with all threads has potential and
> > doesn't need to be at the nfsd level - it could be in sunrpc.  Maybe it could
> > be built into the current 'flush' interface.

The flush operation will have to know which services to wait on when
flushing a given cache (lockd and nfsd in the export cache cases).

A little annoying that it may end up having to wait on a client-side
operation in the case of lockd, but I don't think that's a show-stopper.

--b.

> 
> We need to keep compatible behavior to prevent deadlocks.  (Don't want
> nfsd waiting on mountd waiting on nfsd.)
> 
> Looks like write_flush currently returns -EINVAL to anything that's not
> an integer.  So exportfs could write something new and ignore the error
> return (or try some other workaround) in the case of an old kernel.
> 
> > 1/ iterate through all no-sleeping threads setting a flag an increasing a
> > counter.
> > 2/ when a thread completes current request, if test_and_clear the flag, it
> > atomic_dec_and_test the counter and then wakes up some wait_queue_head.
> > 3/ 'flush'ing thread waits on the waut_queue_head for the counter to be 0.
> > 
> > If you don't hate it I could possibly even provide some code.
> 
> That sounds reasonable to me.  So you'd just add a single such
> thread-synchronization after modifying mountd's idea of the export
> table, ok.
> 
> It still wouldn't allow an unmount in the case a client held an NSM lock
> or v4 open--but I think that's what we want.  If somebody wants a way to
> unmount even in the presence of such state, then they really need to do
> a complete shutdown.
> 
> I wonder if there's also still a use for an operation that stops all
> threads temporarily but doesn't toss any state or caches?  I'm not
> coming up with one off the top of my head.
> 
> --b.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Linux NFS and cached properties
  2012-07-31 12:45           ` J. Bruce Fields
@ 2012-07-31 14:07             ` J. Bruce Fields
  0 siblings, 0 replies; 10+ messages in thread
From: J. Bruce Fields @ 2012-07-31 14:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: ZUIDAM, Hans, linux-nfs@vger.kernel.org, DE WITTE, PETER

On Tue, Jul 31, 2012 at 08:45:50AM -0400, J. Bruce Fields wrote:
> On Tue, Jul 31, 2012 at 08:25:46AM -0400, J. Bruce Fields wrote:
> > On Tue, Jul 31, 2012 at 03:08:01PM +1000, NeilBrown wrote:
> > > The idea of a new interface to synchronise with all threads has potential and
> > > doesn't need to be at the nfsd level - it could be in sunrpc.  Maybe it could
> > > be built into the current 'flush' interface.
> 
> The flush operation will have to know which services to wait on when
> flushing a given cache (lockd and nfsd in the export cache cases).
> 
> A little annoying that it may end up having to wait on a client-side
> operation in the case of lockd, but I don't think that's a show-stopper.

Ignore me, I wasn't thinking straight: a lockd thread won't of course be
waiting on client rpc's to a server, it will just be handling callbacks,
which should be very quick.

--b.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Linux NFS and cached properties
  2012-07-31 12:25         ` J. Bruce Fields
  2012-07-31 12:45           ` J. Bruce Fields
@ 2012-08-02  0:04           ` NeilBrown
  2012-08-02  2:50             ` J. Bruce Fields
  2012-08-16 19:10             ` J. Bruce Fields
  1 sibling, 2 replies; 10+ messages in thread
From: NeilBrown @ 2012-08-02  0:04 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: ZUIDAM, Hans, linux-nfs@vger.kernel.org, DE WITTE, PETER

[-- Attachment #1: Type: text/plain, Size: 7460 bytes --]

On Tue, 31 Jul 2012 08:25:46 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Tue, Jul 31, 2012 at 03:08:01PM +1000, NeilBrown wrote:
> > On Thu, 26 Jul 2012 18:36:07 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> > wrote:
> > 
> > > On Tue, Jul 24, 2012 at 05:28:02PM +0000, ZUIDAM, Hans wrote:
> > > > Hi Bruce,
> > > > 
> > > > Thanks for the clarification.
> > > > 
> > > > (I'm repeating a lot of my original mail because of the Cc: list.)
> > > > 
> > > > > J. Bruce Fields
> > > > > I think that's right, though I'm curious how you're managing to hit
> > > > > that case reliably every time.  Or is this an intermittent failure?
> > > > It's an intermittent failure, but with the procedure shown below it is
> > > > fairly easy to reproduce.    The actual problem we see in our product
> > > > is because of the way external storage media are handled in user-land.
> > > > 
> > > >         192.168.1.10# mount -t xfs /dev/sdcr/sda1 /mnt
> > > >         192.168.1.10# exportfs 192.168.1.11:/mnt
> > > > 
> > > >         192.168.1.11# mount 192.168.1.10:/mnt /mnt
> > > >         192.168.1.11# umount /mnt
> > > > 
> > > >         192.168.1.10# exportfs -u 192.168.1.11:/mnt
> > > >         192.168.1.10# umount /mnt
> > > >         umount: can't umount /media/recdisk: Device or resource busy
> > > > 
> > > > What I actually do is the mount/unmount on the client via ssh.  That
> > > > is a good way to trigger the problem.
> > > > 
> > > > We see that during the un-export the NFS caches are not flushed
> > > > properly which is why the final unmount fails.
> > > > 
> > > > In net/sunrpc/cache.c the cache times (last_refresh, expiry_time,
> > > > flush_time) are measured in seconds.  If I understand the code somewhat
> > > > then during an NFS un-export the is done by setting the flush_time to
> > > > the current time.  The cache_flush() is called.  If in that same second
> > > > last_refresh is set to the current time then the cached item is not
> > > > flushed.  This will subsequently cause un-mount to fail because there
> > > > is still a reference to the mount point.
> > > > 
> > > > > J. Bruce Fields
> > > > > I ran across that recently while reviewing the code to fix a related
> > > > > problem.  I'm not sure what the best fix would be.
> > > > >
> > > > > Previously raised here:
> > > > >
> > > > >       http://marc.info/?l=linux-nfs&m=133514319408283&w=2
> > > > 
> > > > The description in your mail does indeed looks the same as the problem
> > > > that we see.
> > > > 
> > > > >From reading the code in net/sunrpc/cache.c I get the impression that it is
> > > > not really possible to reliably flush the caches for an un-exportfs such
> > > > that after flushing they will not accept entries for the un-exported IP/mount
> > > > point combination.
> > > 
> > > Right.  So, possible ideas, from that previous message:
> > > 
> > > 	- As Neil suggests, modify exportfs to wait a second between
> > > 	  updating etab and flushing the cache.  At that point any
> > > 	  entries still using the old information are at least a second
> > > 	  old.  That may be adequate for your case, but if someone out
> > > 	  there is sensitive to the time required to unexport then that
> > > 	  will annoy them.  It also leaves the small possibility of
> > > 	  races where an in-progress rpc may still be using an export at
> > > 	  the time you try to flush.
> > > 	- Implement some new interface that you can use to flush the
> > > 	  cache and that doesn't return until in-progress rpc's
> > > 	  complete.  Since it waits for rpc's it's not purely a "cache"
> > > 	  layer interface any more.  So maybe something like
> > > 	  /proc/fs/nfsd/flush_exports.
> > > 	- As a workaround requiring no code changes: unexport, then shut
> > > 	  down the server entirely and restart it.  Clients will see
> > > 	  that as a reboot recovery event and recover automatically, but
> > > 	  applications may see delays while that happens.  Kind of a big
> > > 	  hammer, but if unexporting while other exports are in use is
> > > 	  rare maybe it would be adequate for your case.
> > 
> > That's a shame...
> > I had originally intended "rpc.nfsd 0" to simple stop all threads and nothing
> > else.  Then you would be able to:
> >    rpc.nfsd 0
> >    exportfs -f
> >    unmount
> >    rpc.nfsd 16
> > 
> > and have a nice fast race-free unmount.
> > But commit e096bbc6488d3e49d476bf986d33752709361277 'fixed' that :-(
> > 
> > I wonder if it can be resurrected ... maybe not worth the effort.
> 
> That also shut down v4 state.  Making the clients recover would
> typically be more expensive than ditching the export table.  (Did it
> also throw out NLM locks?  I can't tell on a quick check.)

No, it didn't do anything except stop all the threads.
I never liked that fact that stopping the last thread did something extra.
So when I added the ability to control the number of threads via sysfs I made
sure that it *only* controlled the number of threads.  However I kept the
legacy behaviour that sending SIGKILL to the nfsd threads would also unexport
things.  Obviously I should have documented this better.

The more I think out it, the more I'd really like to go back to that.  It
really is the *right* thing to do.

> 
> > The idea of a new interface to synchronise with all threads has potential and
> > doesn't need to be at the nfsd level - it could be in sunrpc.  Maybe it could
> > be built into the current 'flush' interface.
> 
> We need to keep compatible behavior to prevent deadlocks.  (Don't want
> nfsd waiting on mountd waiting on nfsd.)
> 
> Looks like write_flush currently returns -EINVAL to anything that's not
> an integer.  So exportfs could write something new and ignore the error
> return (or try some other workaround) in the case of an old kernel.
> 
> > 1/ iterate through all no-sleeping threads setting a flag an increasing a
> > counter.
> > 2/ when a thread completes current request, if test_and_clear the flag, it
> > atomic_dec_and_test the counter and then wakes up some wait_queue_head.
> > 3/ 'flush'ing thread waits on the waut_queue_head for the counter to be 0.
> > 
> > If you don't hate it I could possibly even provide some code.
> 
> That sounds reasonable to me.  So you'd just add a single such
> thread-synchronization after modifying mountd's idea of the export
> table, ok.
> 
> It still wouldn't allow an unmount in the case a client held an NSM lock
> or v4 open--but I think that's what we want.  If somebody wants a way to
> unmount even in the presence of such state, then they really need to do
> a complete shutdown.
> 
> I wonder if there's also still a use for an operation that stops all
> threads temporarily but doesn't toss any state or caches?  I'm not
> coming up with one off the top of my head.
> 
> --b.

Actually, I think you were right the first time.  The cache isn't really well
positioned as it doesn't have a list of services to synchronise with.
We could give it one, but I don't that is such a good idea.

We already have a way to forcably drop all locks on a filesystem don't we?
   /proc/fs/nfsd/unlock_filesystem

Does that unlock the filesystem from the nfsv4 perspective too?  Should it?

I wonder if it might make sense to insert an 'sync with various threads' call
in there.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Linux NFS and cached properties
  2012-08-02  0:04           ` NeilBrown
@ 2012-08-02  2:50             ` J. Bruce Fields
  2012-08-16 19:10             ` J. Bruce Fields
  1 sibling, 0 replies; 10+ messages in thread
From: J. Bruce Fields @ 2012-08-02  2:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: ZUIDAM, Hans, linux-nfs@vger.kernel.org, DE WITTE, PETER

On Thu, Aug 02, 2012 at 10:04:05AM +1000, NeilBrown wrote:
> On Tue, 31 Jul 2012 08:25:46 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> wrote:
> 
> > On Tue, Jul 31, 2012 at 03:08:01PM +1000, NeilBrown wrote:
> > > On Thu, 26 Jul 2012 18:36:07 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> > > wrote:
> > > 
> > > > On Tue, Jul 24, 2012 at 05:28:02PM +0000, ZUIDAM, Hans wrote:
> > > > > Hi Bruce,
> > > > > 
> > > > > Thanks for the clarification.
> > > > > 
> > > > > (I'm repeating a lot of my original mail because of the Cc: list.)
> > > > > 
> > > > > > J. Bruce Fields
> > > > > > I think that's right, though I'm curious how you're managing to hit
> > > > > > that case reliably every time.  Or is this an intermittent failure?
> > > > > It's an intermittent failure, but with the procedure shown below it is
> > > > > fairly easy to reproduce.    The actual problem we see in our product
> > > > > is because of the way external storage media are handled in user-land.
> > > > > 
> > > > >         192.168.1.10# mount -t xfs /dev/sdcr/sda1 /mnt
> > > > >         192.168.1.10# exportfs 192.168.1.11:/mnt
> > > > > 
> > > > >         192.168.1.11# mount 192.168.1.10:/mnt /mnt
> > > > >         192.168.1.11# umount /mnt
> > > > > 
> > > > >         192.168.1.10# exportfs -u 192.168.1.11:/mnt
> > > > >         192.168.1.10# umount /mnt
> > > > >         umount: can't umount /media/recdisk: Device or resource busy
> > > > > 
> > > > > What I actually do is the mount/unmount on the client via ssh.  That
> > > > > is a good way to trigger the problem.
> > > > > 
> > > > > We see that during the un-export the NFS caches are not flushed
> > > > > properly which is why the final unmount fails.
> > > > > 
> > > > > In net/sunrpc/cache.c the cache times (last_refresh, expiry_time,
> > > > > flush_time) are measured in seconds.  If I understand the code somewhat
> > > > > then during an NFS un-export the is done by setting the flush_time to
> > > > > the current time.  The cache_flush() is called.  If in that same second
> > > > > last_refresh is set to the current time then the cached item is not
> > > > > flushed.  This will subsequently cause un-mount to fail because there
> > > > > is still a reference to the mount point.
> > > > > 
> > > > > > J. Bruce Fields
> > > > > > I ran across that recently while reviewing the code to fix a related
> > > > > > problem.  I'm not sure what the best fix would be.
> > > > > >
> > > > > > Previously raised here:
> > > > > >
> > > > > >       http://marc.info/?l=linux-nfs&m=133514319408283&w=2
> > > > > 
> > > > > The description in your mail does indeed looks the same as the problem
> > > > > that we see.
> > > > > 
> > > > > >From reading the code in net/sunrpc/cache.c I get the impression that it is
> > > > > not really possible to reliably flush the caches for an un-exportfs such
> > > > > that after flushing they will not accept entries for the un-exported IP/mount
> > > > > point combination.
> > > > 
> > > > Right.  So, possible ideas, from that previous message:
> > > > 
> > > > 	- As Neil suggests, modify exportfs to wait a second between
> > > > 	  updating etab and flushing the cache.  At that point any
> > > > 	  entries still using the old information are at least a second
> > > > 	  old.  That may be adequate for your case, but if someone out
> > > > 	  there is sensitive to the time required to unexport then that
> > > > 	  will annoy them.  It also leaves the small possibility of
> > > > 	  races where an in-progress rpc may still be using an export at
> > > > 	  the time you try to flush.
> > > > 	- Implement some new interface that you can use to flush the
> > > > 	  cache and that doesn't return until in-progress rpc's
> > > > 	  complete.  Since it waits for rpc's it's not purely a "cache"
> > > > 	  layer interface any more.  So maybe something like
> > > > 	  /proc/fs/nfsd/flush_exports.
> > > > 	- As a workaround requiring no code changes: unexport, then shut
> > > > 	  down the server entirely and restart it.  Clients will see
> > > > 	  that as a reboot recovery event and recover automatically, but
> > > > 	  applications may see delays while that happens.  Kind of a big
> > > > 	  hammer, but if unexporting while other exports are in use is
> > > > 	  rare maybe it would be adequate for your case.
> > > 
> > > That's a shame...
> > > I had originally intended "rpc.nfsd 0" to simple stop all threads and nothing
> > > else.  Then you would be able to:
> > >    rpc.nfsd 0
> > >    exportfs -f
> > >    unmount
> > >    rpc.nfsd 16
> > > 
> > > and have a nice fast race-free unmount.
> > > But commit e096bbc6488d3e49d476bf986d33752709361277 'fixed' that :-(
> > > 
> > > I wonder if it can be resurrected ... maybe not worth the effort.
> > 
> > That also shut down v4 state.  Making the clients recover would
> > typically be more expensive than ditching the export table.  (Did it
> > also throw out NLM locks?  I can't tell on a quick check.)
> 
> No, it didn't do anything except stop all the threads.
> I never liked that fact that stopping the last thread did something extra.
> So when I added the ability to control the number of threads via sysfs I made
> sure that it *only* controlled the number of threads.  However I kept the
> legacy behaviour that sending SIGKILL to the nfsd threads would also unexport
> things.  Obviously I should have documented this better.
> 
> The more I think out it, the more I'd really like to go back to that.  It
> really is the *right* thing to do.

Could be.

The nfsd startup/shutdown has a number of problems, and figuring out how
to containerize it will uncover some more.  I'm open to any suggestions.

> > > The idea of a new interface to synchronise with all threads has potential and
> > > doesn't need to be at the nfsd level - it could be in sunrpc.  Maybe it could
> > > be built into the current 'flush' interface.
> > 
> > We need to keep compatible behavior to prevent deadlocks.  (Don't want
> > nfsd waiting on mountd waiting on nfsd.)
> > 
> > Looks like write_flush currently returns -EINVAL to anything that's not
> > an integer.  So exportfs could write something new and ignore the error
> > return (or try some other workaround) in the case of an old kernel.
> > 
> > > 1/ iterate through all no-sleeping threads setting a flag an increasing a
> > > counter.
> > > 2/ when a thread completes current request, if test_and_clear the flag, it
> > > atomic_dec_and_test the counter and then wakes up some wait_queue_head.
> > > 3/ 'flush'ing thread waits on the waut_queue_head for the counter to be 0.
> > > 
> > > If you don't hate it I could possibly even provide some code.
> > 
> > That sounds reasonable to me.  So you'd just add a single such
> > thread-synchronization after modifying mountd's idea of the export
> > table, ok.
> > 
> > It still wouldn't allow an unmount in the case a client held an NSM lock
> > or v4 open--but I think that's what we want.  If somebody wants a way to
> > unmount even in the presence of such state, then they really need to do
> > a complete shutdown.
> > 
> > I wonder if there's also still a use for an operation that stops all
> > threads temporarily but doesn't toss any state or caches?  I'm not
> > coming up with one off the top of my head.
> > 
> > --b.
> 
> Actually, I think you were right the first time.  The cache isn't really well
> positioned as it doesn't have a list of services to synchronise with.
> We could give it one, but I don't that is such a good idea.

OK.

> We already have a way to forcably drop all locks on a filesystem don't we?
>    /proc/fs/nfsd/unlock_filesystem
> 
> Does that unlock the filesystem from the nfsv4 perspective too?  Should it?

It doesn't, but yes it probably should.

> I wonder if it might make sense to insert an 'sync with various threads' call
> in there.

Probably so.

But we still need a version that doesn't break locks: an admin should be
able to say "unexport and unmount if you can, but not if it means
throwing away client state".

--b.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Linux NFS and cached properties
  2012-08-02  0:04           ` NeilBrown
  2012-08-02  2:50             ` J. Bruce Fields
@ 2012-08-16 19:10             ` J. Bruce Fields
  2012-08-16 21:05               ` NeilBrown
  1 sibling, 1 reply; 10+ messages in thread
From: J. Bruce Fields @ 2012-08-16 19:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: ZUIDAM, Hans, linux-nfs@vger.kernel.org, DE WITTE, PETER

On Thu, Aug 02, 2012 at 10:04:05AM +1000, NeilBrown wrote:
> I never liked that fact that stopping the last thread did something extra.
> So when I added the ability to control the number of threads via sysfs I made

(You meant nfsd, not sysfs, right?  Or is there some interface I'm
overlooking?)

> sure that it *only* controlled the number of threads.  However I kept the
> legacy behaviour that sending SIGKILL to the nfsd threads would also unexport
> things.  Obviously I should have documented this better.
> 
> The more I think out it, the more I'd really like to go back to that.  It
> really is the *right* thing to do.
...
> > > 1/ iterate through all no-sleeping threads setting a flag an increasing a
> > > counter.
> > > 2/ when a thread completes current request, if test_and_clear the flag, it
> > > atomic_dec_and_test the counter and then wakes up some wait_queue_head.
> > > 3/ 'flush'ing thread waits on the waut_queue_head for the counter to be 0.
> > > 
> > > If you don't hate it I could possibly even provide some code.

By the way, are you still looking into one of those approaches?

--b.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Linux NFS and cached properties
  2012-08-16 19:10             ` J. Bruce Fields
@ 2012-08-16 21:05               ` NeilBrown
  0 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2012-08-16 21:05 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: ZUIDAM, Hans, linux-nfs@vger.kernel.org, DE WITTE, PETER

[-- Attachment #1: Type: text/plain, Size: 1400 bytes --]

On Thu, 16 Aug 2012 15:10:18 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Thu, Aug 02, 2012 at 10:04:05AM +1000, NeilBrown wrote:
> > I never liked that fact that stopping the last thread did something extra.
> > So when I added the ability to control the number of threads via sysfs I made
> 
> (You meant nfsd, not sysfs, right?  Or is there some interface I'm
> overlooking?)

Right.

> 
> > sure that it *only* controlled the number of threads.  However I kept the
> > legacy behaviour that sending SIGKILL to the nfsd threads would also unexport
> > things.  Obviously I should have documented this better.
> > 
> > The more I think out it, the more I'd really like to go back to that.  It
> > really is the *right* thing to do.
> ...
> > > > 1/ iterate through all no-sleeping threads setting a flag an increasing a
> > > > counter.
> > > > 2/ when a thread completes current request, if test_and_clear the flag, it
> > > > atomic_dec_and_test the counter and then wakes up some wait_queue_head.
> > > > 3/ 'flush'ing thread waits on the waut_queue_head for the counter to be 0.
> > > > 
> > > > If you don't hate it I could possibly even provide some code.
> 
> By the way, are you still looking into one of those approaches?

Not yet.  Other things got in the way.

I made a note to remind me when room appears in my schedule. :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-08-16 21:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <D307B3AC0BCD4C419E6B8FA6A2720A9C0C3B2F@011-DB3MPN1-001.MGDPHG.emi.philips.com>
     [not found] ` <20120724143748.GC8570@fieldses.org>
2012-07-24 17:28   ` Linux NFS and cached properties ZUIDAM, Hans
2012-07-26 22:36     ` J. Bruce Fields
2012-07-31  5:08       ` NeilBrown
2012-07-31 12:25         ` J. Bruce Fields
2012-07-31 12:45           ` J. Bruce Fields
2012-07-31 14:07             ` J. Bruce Fields
2012-08-02  0:04           ` NeilBrown
2012-08-02  2:50             ` J. Bruce Fields
2012-08-16 19:10             ` J. Bruce Fields
2012-08-16 21:05               ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).