Linux NFS development
 help / color / mirror / Atom feed
From: Bruce Fields <bfields@fieldses.org>
To: "Bradley C. Kuszmaul" <bradley.kuszmaul@oracle.com>
Cc: Chuck Lever <chuck.lever@oracle.com>,
	Jeff Layton <jlayton@poochiereds.net>,
	Trond Myklebust <trondmy@hammerspace.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: directory delegations
Date: Thu, 4 Apr 2019 16:41:16 -0400	[thread overview]
Message-ID: <20190404204116.GA27839@fieldses.org> (raw)
In-Reply-To: <9ca6e116-818b-6615-1532-47611b6fcc6f@oracle.com>

On Thu, Apr 04, 2019 at 04:03:42PM -0400, Bradley C. Kuszmaul wrote:
> It would also be possible with our file system to preallocate inode
> numbers (inumbers).
> 
> This isn't necessarily directly related to NFS, but one could
> imagine further extending NSF to allow a CREATE to happen entirely
> on the client by letting the client maintain a cache of preallocated
> inumbers.

So, we'd need new protocol to allow clients to request inode numbers,
and I guess we'd also need vfs interfaces to allow our server to request
them from various filesystems.  Naively, it sounds doable.  From what
Jeff says, this isn't a requirement for correctness, it's an
optimization for a case when the client creates and then immediately
does a stat (or readdir?).  Is that important?

--b.

> 
> Just for the fun of it, I'll tell you a little bit more about how we
> preallocate inumbers.
> 
> For Oracle's File Storage Service (FSS), Inumbers are cheap to
> allocate, and it's not a big deal if a few of them end up unused.
> Unused inode numbers don't use up any space. I would imagine that
> most B-tree-based file systems are like this.   In contrast in an
> ext-style file system, unused inumbers imply unused storage.
> 
> Furthermore, FSS never reuses inumbers when files are deleted. It
> just keeps allocating new ones.
> 
> There's a tradeoff between preallocating lots of inumbers to get
> better performance but potentially wasting the inumbers if the
> client were to crash just after getting a batch.   If you only ask
> for one at a time, you don't get much performance, but if you ask
> for 1000 at a time, there's a chance that the client could start,
> ask for 1000 and then immediately crash, and then repeat the cycle,
> quickly using up many inumbers.  Here's a 2-competetive algorithm to
> solve this problem (by "2-competetive" I mean that it's guaranteed
> to waste at most half of the inode numbers):
> 
>  * A client that has successfully created K files without crashing
> is allowed, when it's preallocated cache of inumbers goes empty, to
> ask for another K inumbers.
> 
> The worst-case lossage occurs if the client crashes just after
> getting K inumbers, and those inumbers go to waste.   But we know
> that the client successfully created K files, so we are wasting at
> most half the inumbers.
> 
> For a long-running client, each time it asks for another batch of
> inumbers, it doubles the size of the request.  For the first file
> created, it does it the old-fashioned way.   For the second file, it
> preallocated a single inumber.   For the third file, it preallocates
> 2 inumbers.   On the fifth file creation, it preallocates 4
> inumbers.  And so forth.
> 
> One obstacle to getting FSS to use any of these ideas is that we
> currently support only NFSv3.   We need to get an NFSv4 server
> going, and then we'll be interested in doing the server work to
> speed up these kinds of metadata workloads.
> 
> -Bradley
> 
> On 4/4/19 11:22 AM, Chuck Lever wrote:
> >
> >>On Apr 4, 2019, at 11:09 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
> >>
> >>On Wed, Apr 3, 2019 at 9:06 PM bfields@fieldses.org
> >><bfields@fieldses.org> wrote:
> >>>On Wed, Apr 03, 2019 at 12:56:24PM -0400, Bradley C. Kuszmaul wrote:
> >>>>This proposal does look like it would be helpful.   How does this
> >>>>kind of proposal play out in terms of actually seeing the light of
> >>>>day in deployed systems?
> >>>We need some people to commit to implementing it.
> >>>
> >>>We have 2-3 testing events a year, so ideally we'd agree to show up with
> >>>implementations at one of those to test and hash out any issues.
> >>>
> >>>We revise the draft based on any experience or feedback we get.  If
> >>>nothing else, it looks like it needs some updates for v4.2.
> >>>
> >>>The on-the-wire protocol change seems small, and my feeling is that if
> >>>there's running code then documenting the protocol and getting it
> >>>through the IETF process shouldn't be a big deal.
> >>>
> >>>--b.
> >>>
> >>>>On 4/2/19 10:07 PM, bfields@fieldses.org wrote:
> >>>>>On Wed, Apr 03, 2019 at 02:02:54AM +0000, Trond Myklebust wrote:
> >>>>>>The create itself needs to be sync, but the attribute delegations mean
> >>>>>>that the client, not the server, is authoritative for the timestamps.
> >>>>>>So the client now owns the atime and mtime, and just sets them as part
> >>>>>>of the (asynchronous) delegreturn some time after you are done writing.
> >>>>>>
> >>>>>>Were you perhaps thinking about this earlier proposal?
> >>>>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dmyklebust-2Dnfsv4-2Dunstable-2Dfile-2Dcreation-2D01&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=YIKOmJLMLfe5wQR3VJI7jGjCnepZlMwumApzvaKItrY&m=qlAJ6dZPGjbcTzNIpkTyk-RTii6lWw1CLIjF6jp3P2Y&s=aTTFNJlRH-dXrQmE4cSYEUd8Kv3ij5cqTJtvgIixMa8&e=
> >>>>>That's it, thanks!
> >>>>>
> >>>>>Bradley is concerned about performance of something like untar on a
> >>>>>backend filesystem with particularly high-latency metadata operations,
> >>>>>so something like your unstable file createion proposal (or actual write
> >>>>>delegations) seems like it should help.
> >>>>>
> >>>>>--b.
> >>The serialized create with something like an untar is a
> >>performance-killer though.
> >>
> >>FWIW, I'm working on something similar right now for Ceph. If a ceph
> >>client has adequate caps [1] for a directory and the dentry inode,
> >>then we should (in principle) be able to buffer up directory morphing
> >>operations and flush them out to the server asynchronously.
> >>
> >>I'm starting with unlink (mostly because it's simpler), and am mainly
> >>just returning early when we do have the right caps -- after issuing
> >>the call but before the reply comes in. We should be able to do the
> >>same for link, rename and create too. Create will require the Ceph MDS
> >>to delegate out a range of inode numbers (and that bit hasn't been
> >>implemented yet).
> >>
> >>My thinking with all of this is that the buffering of directory
> >>morphing operations is not as helpful as something like a pagecache
> >>write is, as we aren't that interested in merging operations that
> >>change the same dentry. However, being able to do them asynchronously
> >>should work really well. That should allow us to better parallellize
> >>create/link/unlink/rename on different dentries even when they are
> >>issued serially by a single task.
> >What happens if an asynchronous directory change fails (eg. ENOSPC)?
> >
> >
> >>RFC5661 doesn't currently provide for writeable directory delegations,
> >>AFAICT, but they could eventually be implemented in a similar way.
> >>
> >>[1]: cephfs capabilies (aka caps) are like a delegation for a subset
> >>of inode metadata
> >>--
> >>Jeff Layton <jlayton@poochiereds.net>
> >--
> >Chuck Lever
> >
> >
> >

  reply	other threads:[~2019-04-04 20:41 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-01 16:21 directory delegations Bradley C. Kuszmaul
2019-04-02 16:11 ` J. Bruce Fields
2019-04-02 17:26   ` Bradley C. Kuszmaul
2019-04-02 17:29     ` Bradley C. Kuszmaul
2019-04-02 19:41     ` J. Bruce Fields
2019-04-02 21:51       ` Trond Myklebust
2019-04-02 22:33         ` Trond Myklebust
2019-04-03  0:28         ` bfields
2019-04-03  2:02           ` Trond Myklebust
2019-04-03  2:07             ` bfields
2019-04-03 16:56               ` Bradley C. Kuszmaul
2019-04-04  1:05                 ` bfields
2019-04-04 15:09                   ` Jeff Layton
2019-04-04 15:22                     ` Chuck Lever
2019-04-04 15:36                       ` Jeff Layton
2019-04-04 20:03                       ` Bradley C. Kuszmaul
2019-04-04 20:41                         ` Bruce Fields [this message]
2019-04-04 20:45                           ` Bradley C. Kuszmaul
2019-04-04 15:37                     ` bfields
2019-04-04 15:44                       ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190404204116.GA27839@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=bradley.kuszmaul@oracle.com \
    --cc=chuck.lever@oracle.com \
    --cc=jlayton@poochiereds.net \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox