linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: Jeff Layton <jlayton@kernel.org>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Joshua Watt <jpewhacker@gmail.com>
Cc: linux-nfs@vger.kernel.org
Subject: Re: NFS Force Unmounting
Date: Wed, 01 Nov 2017 11:53:18 +1100	[thread overview]
Message-ID: <8760aux1j5.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <1509460909.4553.37.camel@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 6917 bytes --]

On Tue, Oct 31 2017, Jeff Layton wrote:

> On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
>> On Mon, Oct 30 2017, J. Bruce Fields wrote:
>> 
>> > On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
>> > > I'm working on a networking embedded system where NFS servers can come
>> > > and go from the network, and I've discovered that the Kernel NFS server
>> > 
>> > For "Kernel NFS server", I think you mean "Kernel NFS client".
>> > 
>> > > make it difficult to cleanup applications in a timely manner when the
>> > > server disappears (and yes, I am mounting with "soft" and relatively
>> > > short timeouts). I currently have a user space mechanism that can
>> > > quickly detect when the server disappears, and does a umount() with the
>> > > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
>> > > to files on the defunct remote server, and I have traced through the
>> > > code to see that MNT_FORCE does indeed cancel any current RPC tasks
>> > > with -EIO. However, this isn't sufficient for my use case because if a
>> > > user space application isn't currently waiting on an RCP task that gets
>> > > canceled, it will have to timeout again before it detects the
>> > > disconnect. For example, if a simple client is copying a file from the
>> > > NFS server, and happens to not be waiting on the RPC task in the read()
>> > > call when umount() occurs, it will be none the wiser and loop around to
>> > > call read() again, which must then try the whole NFS timeout + recovery
>> > > before the failure is detected. If a client is more complex and has a
>> > > lot of open file descriptor, it will typical have to wait for each one
>> > > to timeout, leading to very long delays.
>> > > 
>> > > The (naive?) solution seems to be to add some flag in either the NFS
>> > > client or the RPC client that gets set in nfs_umount_begin(). This
>> > > would cause all subsequent operations to fail with an error code
>> > > instead of having to be queued as an RPC task and the and then timing
>> > > out. In our example client, the application would then get the -EIO
>> > > immediately on the next (and all subsequent) read() calls.
>> > > 
>> > > There does seem to be some precedence for doing this (especially with
>> > > network file systems), as both cifs (CifsExiting) and ceph
>> > > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
>> > > looking at the code. I haven't verified runtime behavior).
>> > > 
>> > > Are there any pitfalls I'm oversimplifying?
>> > 
>> > I don't know.
>> > 
>> > In the hard case I don't think you'd want to do something like
>> > this--applications expect mounts to be stay pinned while they're using
>> > them, not to get -EIO.  In the soft case maybe an exception like this
>> > makes sense.
>> 
>> Applications also expect to get responses to read() requests, and expect
>> fsync() to complete, but if the servers has melted down, that isn't
>> going to happen.  Sometimes unexpected errors are better than unexpected
>> infinite delays.
>> 
>> I think we need a reliable way to unmount an NFS filesystem mounted from
>> a non-responsive server.  Maybe that just means fixing all the places
>> where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
>> would allow processes accessing the filesystem to be killed.  I don't
>> know if that would meet Joshua's needs.
>> 
>> Last time this came up, Trond didn't want to make MNT_FORCE too strong as
>> it only makes sense to be forceful on the final unmount, and we cannot
>> know if this is the "final" unmount (no other bind-mounts around) until
>> much later than ->umount_prepare.  Maybe umount is the wrong interface.
>> Maybe we should expose "struct nfs_client" (or maybe "struct
>> nfs_server") objects via sysfs so they can be marked "dead" (or similar)
>> meaning that all IO should fail.
>> 
>
> I like this idea.
>
> Note that we already have some per-rpc_xprt / per-rpc_clnt info in
> debugfs sunrpc dir. We could make some writable files in there, to allow
> you to kill off individual RPCs or maybe mark a whole clnt and/or xprt
> dead in some fashion.
>
> I don't really have a good feel for what this interface should look like
> yet. debugfs is attractive here, as it's supposedly not part of the
> kernel ABI guarantee. That allows us to do some experimentation in this
> area, without making too big an initial commitment.

debugfs might be attractive to kernel developers: "all care but not
responsibility", but not so much to application developers (though I do
realize that your approch was "something to experiment with" so maybe
that doesn't matter).

My particular focus is to make systemd shutdown completely reliable.  It
should not block indefinitely on any condition, including inaccessible
servers and broken networks.

In stark contrast to Chuck's suggestion that


   Any RPC that might alter cached data/metadata is not, but others
   would be safe.

("safe" here meaning "safe to kill the RPC"), I think that everything
can and should be killed.  Maybe the first step is to purge any dirty
pages from the cache.
- the server is up, we write the data
- if we are happy to wait, we wait
- otherwise (the case I'm interested in), we just destroy anything
  that gets in the way of unmounting the filesystem.

I'd also like to make the interface completely generic.  I'd rather
systemd didn't need to know any specific details about nfs (it already
does to some extend - it knows it is a "remote" filesystem) but
I'd rather not require more.

Maybe I could just sweep the problem under the carpet and use lazy
unmounts.  That hides some of the problem, but doesn't stop sync(2) from
blocking indefinitely.  And once you have done the lazy unmount, there
is no longer any opportunity to use MNT_FORCE.

Another way to think about this is to consider the bdi rather than the
mount point.  If the NFS server is never coming back, then the "backing
device" is broken.  If /sys/class/bdi/* contained suitable information
to identify the right backing device, and had some way to "terminate
with extreme prejudice", then and admin process (like systemd or
anything else) could choose to terminate a bdi that was not working
properly.

We would need quite a bit of integration so that this "terminate"
command would take effect, cause various syscalls to return EIO, purge
dirty memory, avoid stalling sync().  But it hopefully it would be
a well defined interface and a good starting point.

If the bdi provided more information and more control, it would be a lot
safer to use lazy unmounts, as we could then work with the filesystem
even after it had been unmounted.

Maybe I'll trying playing with bdis in my spare time (if I ever find out
what "spare time" is).

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  parent reply	other threads:[~2017-11-01  0:53 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-25 17:11 NFS Force Unmounting Joshua Watt
2017-10-30 20:20 ` J. Bruce Fields
2017-10-30 21:04   ` Joshua Watt
2017-10-30 21:09   ` NeilBrown
2017-10-31 14:41     ` Jeff Layton
2017-10-31 14:55       ` Chuck Lever
2017-10-31 17:04         ` Joshua Watt
2017-10-31 19:46           ` Chuck Lever
2017-11-01  0:53       ` NeilBrown [this message]
2017-11-01  2:22         ` Chuck Lever
2017-11-01 14:38           ` Joshua Watt
2017-11-02  0:15           ` NeilBrown
2017-11-02 19:46             ` Chuck Lever
2017-11-02 21:51               ` NeilBrown
2017-11-01 17:24     ` Jeff Layton
2017-11-01 23:13       ` NeilBrown
2017-11-02 12:09         ` Jeff Layton
2017-11-02 14:54           ` Joshua Watt
2017-11-08  3:30             ` NeilBrown
2017-11-08 12:08               ` Jeff Layton
2017-11-08 15:52                 ` J. Bruce Fields
2017-11-08 22:34                   ` NeilBrown
2017-11-08 23:52                     ` Trond Myklebust
2017-11-09 19:48                       ` Joshua Watt
2017-11-10  0:16                         ` NeilBrown
2017-11-08 14:59             ` [RFC 0/4] " Joshua Watt
2017-11-08 14:59               ` [RFC 1/4] SUNRPC: Add flag to kill new tasks Joshua Watt
2017-11-10  1:39                 ` NeilBrown
2017-11-08 14:59               ` [RFC 2/4] SUNRPC: Kill client tasks from debugfs Joshua Watt
2017-11-10  1:47                 ` NeilBrown
2017-11-10 14:13                   ` Joshua Watt
2017-11-08 14:59               ` [RFC 3/4] SUNRPC: Simplify client shutdown Joshua Watt
2017-11-10  1:50                 ` NeilBrown
2017-11-08 14:59               ` [RFC 4/4] NFS: Add forcekill mount option Joshua Watt
2017-11-10  2:01                 ` NeilBrown
2017-11-10 14:16                   ` Joshua Watt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8760aux1j5.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=bfields@fieldses.org \
    --cc=jlayton@kernel.org \
    --cc=jpewhacker@gmail.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).