From: Trond Myklebust <trondmy@hammerspace.com>
To: "bfields@fieldses.org" <bfields@fieldses.org>,
"ffilzlnx@mindspring.com" <ffilzlnx@mindspring.com>
Cc: "linux-cachefs@redhat.com" <linux-cachefs@redhat.com>,
"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
"daire@dneg.com" <daire@dneg.com>
Subject: Re: Adventures in NFS re-exporting
Date: Thu, 3 Dec 2020 22:50:46 +0000 [thread overview]
Message-ID: <72362b839eb2ecc992f41a0e7783545b13e8ecbc.camel@hammerspace.com> (raw)
In-Reply-To: <01a001d6c9c5$37eb34f0$a7c19ed0$@mindspring.com>
On Thu, 2020-12-03 at 14:39 -0800, Frank Filz wrote:
>
>
> > -----Original Message-----
> > From: Trond Myklebust [mailto:trondmy@hammerspace.com]
> > Sent: Thursday, December 3, 2020 2:14 PM
> > To: bfields@fieldses.org
> > Cc: linux-cachefs@redhat.com; ffilzlnx@mindspring.com; linux-
> > nfs@vger.kernel.org; daire@dneg.com
> > Subject: Re: Adventures in NFS re-exporting
> >
> > On Thu, 2020-12-03 at 17:04 -0500, bfields@fieldses.org wrote:
> > > On Thu, Dec 03, 2020 at 09:57:41PM +0000, Trond Myklebust wrote:
> > > > On Thu, 2020-12-03 at 13:45 -0800, Frank Filz wrote:
> > > > > > On Thu, 2020-12-03 at 16:13 -0500,
> > > > > > bfields@fieldses.org wrote:
> > > > > > > On Thu, Dec 03, 2020 at 08:27:39PM +0000, Trond Myklebust
> > > > > > > wrote:
> > > > > > > > On Thu, 2020-12-03 at 13:51 -0500, bfields wrote:
> > > > > > > > > I've been scratching my head over how to handle
> > > > > > > > > reboot of
> > > > > > > > > a
> > > > > > > > > re-
> > > > > > > > > exporting server. I think one way to fix it might be
> > > > > > > > > just
> > > > > > > > > to allow the re- export server to pass along reclaims
> > > > > > > > > to
> > > > > > > > > the original server as it receives them from its own
> > > > > > > > > clients. It might require some protocol tweaks, I'm
> > > > > > > > > not
> > > > > > > > > sure. I'll try to get my thoughts in order and
> > > > > > > > > propose
> > > > > > > > > something.
> > > > > > > > >
> > > > > > > >
> > > > > > > > It's more complicated than that. If the re-exporting
> > > > > > > > server
> > > > > > > > reboots, but the original server does not, then unless
> > > > > > > > that
> > > > > > > > re- exporting server persisted its lease and a full set
> > > > > > > > of
> > > > > > > > stateids somewhere, it will not be able to atomically
> > > > > > > > reclaim delegation and lock state on the server on
> > > > > > > > behalf of
> > > > > > > > its clients.
> > > > > > >
> > > > > > > By sending reclaims to the original server, I mean
> > > > > > > literally
> > > > > > > sending new open and lock requests with the RECLAIM bit
> > > > > > > set,
> > > > > > > which would get brand new stateids.
> > > > > > >
> > > > > > > So, the original server would invalidate the existing
> > > > > > > client's
> > > > > > > previous clientid and stateids--just as it normally would
> > > > > > > on
> > > > > > > reboot--but it would optionally remember the underlying
> > > > > > > locks
> > > > > > > held by the client and allow compatible lock reclaims.
> > > > > > >
> > > > > > > Rough attempt:
> > > > > > >
> > > > > > >
> > > > > > > https://wiki.linux-nfs.org/wiki/index.php/Reboot_recovery_for_
> > > > > > > re-expor
> > > > > > > t_servers
> > > > > > >
> > > > > > > Think it would fly?
> > > > > >
> > > > > > So this would be a variant of courtesy locks that can be
> > > > > > reclaimed by the client using the reboot reclaim variant of
> > > > > > OPEN/LOCK outside the grace period? The purpose being to
> > > > > > allow
> > > > > > reclaim without forcing the client to persist the original
> > > > > > stateid?
> > > > > >
> > > > > > Hmm... That's doable, but how about the following
> > > > > > alternative:
> > > > > > Add
> > > > > > a function
> > > > > > that allows the client to request the full list of stateids
> > > > > > that
> > > > > > the server holds on its behalf?
> > > > > >
> > > > > > I've been wanting such a function for quite a while anyway
> > > > > > in
> > > > > > order to allow the client to detect state leaks (either due
> > > > > > to
> > > > > > soft timeouts, or due to reordered close/open operations).
> > > > >
> > > > > Oh, that sounds interesting. So basically the re-export
> > > > > server
> > > > > would re-populate it's state from the original server rather
> > > > > than
> > > > > relying on it's clients doing reclaims? Hmm, but how does the
> > > > > re-export server rebuild its stateids? I guess it could make
> > > > > the
> > > > > clients repopulate them with the same "give me a dump of all
> > > > > my
> > > > > state", using the state details to match up with the old
> > > > > state and
> > > > > replacing stateids. Or did you have something different in
> > > > > mind?
> > > > >
> > > >
> > > > I was thinking that the re-export server could just use that
> > > > list of
> > > > stateids to figure out which locks can be reclaimed atomically,
> > > > and
> > > > which ones have been irredeemably lost. The assumption is that
> > > > if
> > > > you have a lock stateid or a delegation, then that means the
> > > > clients
> > > > can reclaim all the locks that were represented by that
> > > > stateid.
> > >
> > > I'm confused about how the re-export server uses that list. Are
> > > you
> > > assuming it persisted its own list across its own crash/reboot?
> > > I
> > > guess that's what I was trying to avoid having to do.
> > >
> > No. The server just uses the stateids as part of a check for 'do I
> > hold state for
> > this file on this server?'. If the answer is 'yes' and the lock
> > owners are sane, then
> > we should be able to assume the full set of locks that lock owner
> > held on that
> > file are still valid.
> >
> > BTW: if the lock owner is also returned by the server, then since
> > the lock owner
> > is an opaque value, it could, for instance, be used by the client
> > to cache info on
> > the server about which uid/gid owns these locks.
>
> Let me see if I'm understanding your idea right...
>
> Re-export server reboots within the extended lease period it's been
> given by the original server. I'm assuming it uses the same clientid?
Yes. It would have to use the same clientid.
> But would probably open new sessions. It requests the list of
> stateids. Hmm, how to make the owner information useful, nfs-ganesha
> doesn't pass on the actual client's owner but rather just passes the
> address of its record for that client owner. Maybe it will have to do
> something a bit different for this degree of re-export support...
>
> Now the re-export server knows which original client lock owners are
> allowed to reclaim state. So it just acquires locks using the
> original stateid as the client reclaims (what happens if the client
> doesn't reclaim a lock? I suppose the re-export server could unlock
> all regions not explicitly locked once reclaim is complete). Since
> the re-export server is acquiring new locks using the original
> stateid it will just overlay the original lock with the new lock and
> write locks don't conflict since they are being acquired by the same
> lock owner. Actually the original server could even balk at a
> "reclaim" in this way that wasn't originally held... And the original
> server could "refresh" the locks, and discard any that aren't
> refreshed at the end of reclaim. That part assumes the original
> server is apprised that what is actually happening is a reclaim.
>
> The re-export server can destroy any stateids that it doesn't receive
> reclaims for.
Right. That's in essence what I'm suggesting. There are corner cases to
be considered: e.g. "what happens if the re-export server crashes after
unlocking on the server, but before passing the LOCKU reply on the the
client", however I think it should be possible to figure out strategies
for those cases.
>
> Hmm, I think if the re-export server is implemented as an HA cluster,
> it should establish a clientid on the original server for each
> virtual IP (assuming that's the unit of HA) that exists. Then when
> virtual IPs are moved, the re-export server just goes through the
> above reclaim process for that clientid.
>
Yes, we could do something like that.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com
next prev parent reply other threads:[~2020-12-03 22:51 UTC|newest]
Thread overview: 129+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-07 17:31 Adventures in NFS re-exporting Daire Byrne
2020-09-08 9:40 ` Mkrtchyan, Tigran
2020-09-08 11:06 ` Daire Byrne
2020-09-15 17:21 ` J. Bruce Fields
2020-09-15 19:59 ` Trond Myklebust
2020-09-16 16:01 ` Daire Byrne
2020-10-19 16:19 ` Daire Byrne
2020-10-19 17:53 ` [PATCH 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 17:53 ` [PATCH 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 17:53 ` [PATCH 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-19 20:05 ` [PATCH v2 0/2] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-19 20:05 ` [PATCH v2 1/2] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-19 20:05 ` [PATCH v2 2/2] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37 ` [PATCH v3 0/3] Add NFSv3 emulation of the lookupp operation trondmy
2020-10-20 18:37 ` [PATCH v3 1/3] NFSv3: Refactor nfs3_proc_lookup() to split out the dentry trondmy
2020-10-20 18:37 ` [PATCH v3 2/3] NFSv3: Add emulation of the lookupp() operation trondmy
2020-10-20 18:37 ` [PATCH v3 3/3] NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp trondmy
2020-10-21 9:33 ` Adventures in NFS re-exporting Daire Byrne
2020-11-09 16:02 ` bfields
2020-11-12 13:01 ` Daire Byrne
2020-11-12 13:57 ` bfields
2020-11-12 18:33 ` Daire Byrne
2020-11-12 20:55 ` bfields
2020-11-12 23:05 ` Daire Byrne
2020-11-13 14:50 ` bfields
2020-11-13 22:26 ` bfields
2020-11-14 12:57 ` Daire Byrne
2020-11-16 15:18 ` bfields
2020-11-16 15:53 ` bfields
2020-11-16 19:21 ` Daire Byrne
2020-11-16 15:29 ` Jeff Layton
2020-11-16 15:56 ` bfields
2020-11-16 16:03 ` Jeff Layton
2020-11-16 16:14 ` bfields
2020-11-16 16:38 ` Jeff Layton
2020-11-16 19:03 ` bfields
2020-11-16 20:03 ` Jeff Layton
2020-11-17 3:16 ` bfields
2020-11-17 3:18 ` [PATCH 1/4] nfsd: move fill_{pre,post}_wcc to nfsfh.c J. Bruce Fields
2020-11-17 3:18 ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-17 12:34 ` Jeff Layton
2020-11-17 15:26 ` J. Bruce Fields
2020-11-17 15:34 ` Jeff Layton
2020-11-20 22:38 ` J. Bruce Fields
2020-11-20 22:39 ` [PATCH 1/8] nfsd: only call inode_query_iversion in the I_VERSION case J. Bruce Fields
2020-11-20 22:39 ` [PATCH 2/8] nfsd: simplify nfsd4_change_info J. Bruce Fields
2020-11-20 22:39 ` [PATCH 3/8] nfsd: minor nfsd4_change_attribute cleanup J. Bruce Fields
2020-11-21 0:34 ` Jeff Layton
2020-11-20 22:39 ` [PATCH 4/8] nfsd4: don't query change attribute in v2/v3 case J. Bruce Fields
2020-11-20 22:39 ` [PATCH 5/8] nfs: use change attribute for NFS re-exports J. Bruce Fields
2020-11-20 22:39 ` [PATCH 6/8] nfsd: move change attribute generation to filesystem J. Bruce Fields
2020-11-21 0:58 ` Jeff Layton
2020-11-21 1:01 ` J. Bruce Fields
2020-11-21 13:00 ` Jeff Layton
2020-11-20 22:39 ` [PATCH 7/8] nfsd: skip some unnecessary stats in the v4 case J. Bruce Fields
2020-11-20 22:39 ` [PATCH 8/8] Revert "nfsd4: support change_attr_type attribute" J. Bruce Fields
2020-11-20 22:44 ` [PATCH 2/4] nfsd: pre/post attr is using wrong change attribute J. Bruce Fields
2020-11-21 1:03 ` Jeff Layton
2020-11-21 21:44 ` Daire Byrne
2020-11-22 0:02 ` bfields
2020-11-22 1:55 ` Daire Byrne
2020-11-22 3:03 ` bfields
2020-11-23 20:07 ` Daire Byrne
2020-11-17 15:25 ` J. Bruce Fields
2020-11-17 3:18 ` [PATCH 3/4] nfs: don't mangle i_version on NFS J. Bruce Fields
2020-11-17 12:27 ` Jeff Layton
2020-11-17 14:14 ` J. Bruce Fields
2020-11-17 3:18 ` [PATCH 4/4] nfs: support i_version in the NFSv4 case J. Bruce Fields
2020-11-17 12:34 ` Jeff Layton
2020-11-24 20:35 ` Adventures in NFS re-exporting Daire Byrne
2020-11-24 21:15 ` bfields
2020-11-24 22:15 ` Frank Filz
2020-11-25 14:47 ` 'bfields'
2020-11-25 16:25 ` Frank Filz
2020-11-25 19:03 ` 'bfields'
2020-11-26 0:04 ` Frank Filz
2020-11-25 17:14 ` Daire Byrne
2020-11-25 19:31 ` bfields
2020-12-03 12:20 ` Daire Byrne
2020-12-03 18:51 ` bfields
2020-12-03 20:27 ` Trond Myklebust
2020-12-03 21:13 ` bfields
2020-12-03 21:32 ` Frank Filz
2020-12-03 21:34 ` Trond Myklebust
2020-12-03 21:45 ` Frank Filz
2020-12-03 21:57 ` Trond Myklebust
2020-12-03 22:04 ` bfields
2020-12-03 22:14 ` Trond Myklebust
2020-12-03 22:39 ` Frank Filz
2020-12-03 22:50 ` Trond Myklebust [this message]
2020-12-03 23:34 ` Frank Filz
2020-12-03 22:44 ` bfields
2020-12-03 21:54 ` bfields
2020-12-03 22:45 ` bfields
2020-12-03 22:53 ` Trond Myklebust
2020-12-03 23:16 ` bfields
2020-12-03 23:28 ` Frank Filz
2020-12-04 1:02 ` Trond Myklebust
2020-12-04 1:41 ` bfields
2020-12-04 2:27 ` Trond Myklebust
2020-09-17 16:01 ` Daire Byrne
2020-09-17 19:09 ` bfields
2020-09-17 20:23 ` Frank van der Linden
2020-09-17 21:57 ` bfields
2020-09-19 11:08 ` Daire Byrne
2020-09-22 16:43 ` Chuck Lever
2020-09-23 20:25 ` Daire Byrne
2020-09-23 21:01 ` Frank van der Linden
2020-09-26 9:00 ` Daire Byrne
2020-09-28 15:49 ` Frank van der Linden
2020-09-28 16:08 ` Chuck Lever
2020-09-28 17:42 ` Frank van der Linden
2020-09-22 12:31 ` Daire Byrne
2020-09-22 13:52 ` Trond Myklebust
2020-09-23 12:40 ` J. Bruce Fields
2020-09-23 13:09 ` Trond Myklebust
2020-09-23 17:07 ` bfields
2020-09-30 19:30 ` [Linux-cachefs] " Jeff Layton
2020-10-01 0:09 ` Daire Byrne
2020-10-01 10:36 ` Jeff Layton
2020-10-01 12:38 ` Trond Myklebust
2020-10-01 16:39 ` Jeff Layton
2020-10-05 12:54 ` Daire Byrne
2020-10-13 9:59 ` Daire Byrne
2020-10-01 18:41 ` J. Bruce Fields
2020-10-01 19:24 ` Trond Myklebust
2020-10-01 19:26 ` bfields
2020-10-01 19:29 ` Trond Myklebust
2020-10-01 19:51 ` bfields
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=72362b839eb2ecc992f41a0e7783545b13e8ecbc.camel@hammerspace.com \
--to=trondmy@hammerspace.com \
--cc=bfields@fieldses.org \
--cc=daire@dneg.com \
--cc=ffilzlnx@mindspring.com \
--cc=linux-cachefs@redhat.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).