From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: linux-nfs-owner@vger.kernel.org Received: from cantor2.suse.de ([195.135.220.15]:58296 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751248Ab2HBAEW (ORCPT ); Wed, 1 Aug 2012 20:04:22 -0400 Date: Thu, 2 Aug 2012 10:04:05 +1000 From: NeilBrown To: "J. Bruce Fields" Cc: "ZUIDAM, Hans" , "linux-nfs@vger.kernel.org" , "DE WITTE, PETER" Subject: Re: Linux NFS and cached properties Message-ID: <20120802100405.4dfc3169@notabene.brown> In-Reply-To: <20120731122546.GA26737@fieldses.org> References: <20120724143748.GC8570@fieldses.org> <20120726223607.GA28982@fieldses.org> <20120731150801.0a4b557b@notabene.brown> <20120731122546.GA26737@fieldses.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/xXG.wQT4yeA9n2oUbq_KjrC"; protocol="application/pgp-signature" Sender: linux-nfs-owner@vger.kernel.org List-ID: --Sig_/xXG.wQT4yeA9n2oUbq_KjrC Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Tue, 31 Jul 2012 08:25:46 -0400 "J. Bruce Fields" wrote: > On Tue, Jul 31, 2012 at 03:08:01PM +1000, NeilBrown wrote: > > On Thu, 26 Jul 2012 18:36:07 -0400 "J. Bruce Fields" > > wrote: > >=20 > > > On Tue, Jul 24, 2012 at 05:28:02PM +0000, ZUIDAM, Hans wrote: > > > > Hi Bruce, > > > >=20 > > > > Thanks for the clarification. > > > >=20 > > > > (I'm repeating a lot of my original mail because of the Cc: list.) > > > >=20 > > > > > J. Bruce Fields > > > > > I think that's right, though I'm curious how you're managing to h= it > > > > > that case reliably every time. Or is this an intermittent failur= e? > > > > It's an intermittent failure, but with the procedure shown below it= is > > > > fairly easy to reproduce. The actual problem we see in our produ= ct > > > > is because of the way external storage media are handled in user-la= nd. > > > >=20 > > > > 192.168.1.10# mount -t xfs /dev/sdcr/sda1 /mnt > > > > 192.168.1.10# exportfs 192.168.1.11:/mnt > > > >=20 > > > > 192.168.1.11# mount 192.168.1.10:/mnt /mnt > > > > 192.168.1.11# umount /mnt > > > >=20 > > > > 192.168.1.10# exportfs -u 192.168.1.11:/mnt > > > > 192.168.1.10# umount /mnt > > > > umount: can't umount /media/recdisk: Device or resource busy > > > >=20 > > > > What I actually do is the mount/unmount on the client via ssh. That > > > > is a good way to trigger the problem. > > > >=20 > > > > We see that during the un-export the NFS caches are not flushed > > > > properly which is why the final unmount fails. > > > >=20 > > > > In net/sunrpc/cache.c the cache times (last_refresh, expiry_time, > > > > flush_time) are measured in seconds. If I understand the code some= what > > > > then during an NFS un-export the is done by setting the flush_time = to > > > > the current time. The cache_flush() is called. If in that same se= cond > > > > last_refresh is set to the current time then the cached item is not > > > > flushed. This will subsequently cause un-mount to fail because the= re > > > > is still a reference to the mount point. > > > >=20 > > > > > J. Bruce Fields > > > > > I ran across that recently while reviewing the code to fix a rela= ted > > > > > problem. I'm not sure what the best fix would be. > > > > > > > > > > Previously raised here: > > > > > > > > > > http://marc.info/?l=3Dlinux-nfs&m=3D133514319408283&w=3D2 > > > >=20 > > > > The description in your mail does indeed looks the same as the prob= lem > > > > that we see. > > > >=20 > > > > >From reading the code in net/sunrpc/cache.c I get the impression t= hat it is > > > > not really possible to reliably flush the caches for an un-exportfs= such > > > > that after flushing they will not accept entries for the un-exporte= d IP/mount > > > > point combination. > > >=20 > > > Right. So, possible ideas, from that previous message: > > >=20 > > > - As Neil suggests, modify exportfs to wait a second between > > > updating etab and flushing the cache. At that point any > > > entries still using the old information are at least a second > > > old. That may be adequate for your case, but if someone out > > > there is sensitive to the time required to unexport then that > > > will annoy them. It also leaves the small possibility of > > > races where an in-progress rpc may still be using an export at > > > the time you try to flush. > > > - Implement some new interface that you can use to flush the > > > cache and that doesn't return until in-progress rpc's > > > complete. Since it waits for rpc's it's not purely a "cache" > > > layer interface any more. So maybe something like > > > /proc/fs/nfsd/flush_exports. > > > - As a workaround requiring no code changes: unexport, then shut > > > down the server entirely and restart it. Clients will see > > > that as a reboot recovery event and recover automatically, but > > > applications may see delays while that happens. Kind of a big > > > hammer, but if unexporting while other exports are in use is > > > rare maybe it would be adequate for your case. > >=20 > > That's a shame... > > I had originally intended "rpc.nfsd 0" to simple stop all threads and n= othing > > else. Then you would be able to: > > rpc.nfsd 0 > > exportfs -f > > unmount > > rpc.nfsd 16 > >=20 > > and have a nice fast race-free unmount. > > But commit e096bbc6488d3e49d476bf986d33752709361277 'fixed' that :-( > >=20 > > I wonder if it can be resurrected ... maybe not worth the effort. >=20 > That also shut down v4 state. Making the clients recover would > typically be more expensive than ditching the export table. (Did it > also throw out NLM locks? I can't tell on a quick check.) No, it didn't do anything except stop all the threads. I never liked that fact that stopping the last thread did something extra. So when I added the ability to control the number of threads via sysfs I ma= de sure that it *only* controlled the number of threads. However I kept the legacy behaviour that sending SIGKILL to the nfsd threads would also unexpo= rt things. Obviously I should have documented this better. The more I think out it, the more I'd really like to go back to that. It really is the *right* thing to do. >=20 > > The idea of a new interface to synchronise with all threads has potenti= al and > > doesn't need to be at the nfsd level - it could be in sunrpc. Maybe it= could > > be built into the current 'flush' interface. >=20 > We need to keep compatible behavior to prevent deadlocks. (Don't want > nfsd waiting on mountd waiting on nfsd.) >=20 > Looks like write_flush currently returns -EINVAL to anything that's not > an integer. So exportfs could write something new and ignore the error > return (or try some other workaround) in the case of an old kernel. >=20 > > 1/ iterate through all no-sleeping threads setting a flag an increasing= a > > counter. > > 2/ when a thread completes current request, if test_and_clear the flag,= it > > atomic_dec_and_test the counter and then wakes up some wait_queue_head. > > 3/ 'flush'ing thread waits on the waut_queue_head for the counter to be= 0. > >=20 > > If you don't hate it I could possibly even provide some code. >=20 > That sounds reasonable to me. So you'd just add a single such > thread-synchronization after modifying mountd's idea of the export > table, ok. >=20 > It still wouldn't allow an unmount in the case a client held an NSM lock > or v4 open--but I think that's what we want. If somebody wants a way to > unmount even in the presence of such state, then they really need to do > a complete shutdown. >=20 > I wonder if there's also still a use for an operation that stops all > threads temporarily but doesn't toss any state or caches? I'm not > coming up with one off the top of my head. >=20 > --b. Actually, I think you were right the first time. The cache isn't really we= ll positioned as it doesn't have a list of services to synchronise with. We could give it one, but I don't that is such a good idea. We already have a way to forcably drop all locks on a filesystem don't we? /proc/fs/nfsd/unlock_filesystem Does that unlock the filesystem from the nfsv4 perspective too? Should it? I wonder if it might make sense to insert an 'sync with various threads' ca= ll in there. NeilBrown --Sig_/xXG.wQT4yeA9n2oUbq_KjrC Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUBUBnD9Tnsnt1WYoG5AQLToQ/6A5wWYtoXJBrVUmipy+us1qJ2E/dP3Rrf C4bYOvW1pC5uPoK+F0EkMYghx22Q6Uy28dmnBeuk0XVB+dGb0u3oWhvNwp7b+3Od zgFEUtmoPSgCFMr5KX6SssGRrc3AfzG29LrGmFrGgQonAC4HVpO44x2NTqVnPDlz E5f/wL/Gk1nKzKZt1Wefm0fNCSmZLnoyFlwX2F7n4NpZma5j1hSlgkgSg3/c5uxf a7gq7T2UG+082BOq958VSoNbOjn8bQnuZHjPw25byr4/tnTbs3THue6wepDngZQG nEVPeOv1Pq7egknbAp9na8EIb+JIV6a/v5IGa17hgFNrxAPpn4R4dUfC+NDmcrXU WBxLbQdVBEN4WAo7+8+cQeNJMwp22Dlb98Z74QV+YNzbTKQIHrlwMXoZ2p1SWwFC VtXgSIRUPZszIyz3hOiqlAeQGS27kYReR1woQNiwdtCNfNXCcJyA6+zfwJSz6ZdD cs78uyWI5B9/8FlKV/JaOq88e1Vjl6zgdDIwryBACMYGh7Ud7D+RVSP94rAxyN2l 4S4Ggv5JICqfBCcosFinkpKCpMwNUzKGwBJLD9qAEVgpF/M/QRC3Xhqh1bXCEc3P v/J33HX1pLlKYi+iezwSW7MjC39EATxKNkuk1dHly/70lg8q20FoTlgmukJSnBNH GSo5fGhTuZI= =YVPY -----END PGP SIGNATURE----- --Sig_/xXG.wQT4yeA9n2oUbq_KjrC--