linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: "Fu, Rodney" <rfu@panasas.com>, Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>, "hch@lst.de" <hch@lst.de>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	linux-api <linux-api@vger.kernel.org>
Subject: Re: Provision for filesystem specific open flags
Date: Mon, 20 Nov 2017 08:53:41 -0500	[thread overview]
Message-ID: <1511186021.4228.20.camel@kernel.org> (raw)
In-Reply-To: <BN3PR0801MB2257BBAA9D7CA6CEDCA04DECAB280@BN3PR0801MB2257.namprd08.prod.outlook.com>

On Tue, 2017-11-14 at 17:35 +0000, Fu, Rodney wrote:
> > The filesystem can still choose to do that for O_DIRECT if it wants - look at
> > all the filesystems that have a "fall back to buffered IO because this is too
> > hard to implement in the direct Io path".
> 
> Yes, I agree that the filesystem can still decide to buffer IO even with
> O_DIRECT, but the application's intent is that the effects of caching are
> minimized.  Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching.
> 
> > IOWs, you've got another set of custom userspace APIs that are needed to make
> > proper use of this open flag?
> 
> Yes and no.  Applications can make ioctls to the filesystem to query or set
> layout details but don't have to.  Directory level default layout attributes can
> be set up by an admin to meet the requirements of the application.
> 
> > > In panfs, a well behaved CONCURRENT_WRITE application will consider 
> > > the file's layout on storage.  Access from different machines will not 
> > > overlap within the same RAID stripe so as not to cause distributed 
> > > stripe lock contention.  Writes to the file that are page aligned can 
> > > be cached and the filesystem can aggregate multiple such writes before 
> > > writing out to storage.  Conversely, a CONCURRENT_WRITE application 
> > > that ends up colliding on the same stripe will see worse performance.  
> > > Non page aligned writes are treated by panfs as write-through and 
> > > non-cachable, as the filesystem will have to assume that the region of 
> > > the page that is untouched by this machine might in fact be written to 
> > > on another machine.  Caching such a page and writing it out later might lead to data corruption.
> > That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if
> > the app doesn't do correctly aligned and sized IO then performance is going to
> > suck, and if the apps doesn't serialize access to the file correctly it can and
> > will corrupt data in the file....
> 
> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have
> opposite intents with respect to caching.  Our filesystem handles them
> differently, so we need to distinguish between the two.
> 
> > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the 
> > > application does not have to implement any caching to see good performance.
> > Sure, but it has to be aware of layout and where/how it can write, which is
> > exactly the same constraints that local filesystems place on O_DIRECT access.
> > Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics
> > and behaviour, IMO.
> 
> I'd like to make a slight adjustment to my proposal.  The HPC community had
> talked about extensions to POSIX to include O_LAZY as a way for filesystems to
> relax data coherency requirements.  There is code in the ceph filesystem that
> uses that flag if defined.  Can we get O_LAZY defined?
> 
> HEC POSIX extension:
> http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf
> 
> Ceph usage of O_LAZY:
> https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78


O_LAZY support was removed from cephfs userland client in 2013:

    commit 94afedf02d07ad4678222aa66289a74b87768810
    Author: Sage Weil <sage@inktank.com>
    Date:   Mon Jul 8 11:24:48 2013 -0700

        client: remove O_LAZY

...part of the problem (and this may just be my lack of understanding)
is that it's not clear what O_LAZY semantics actually are. The ceph
sources have a textfile with this in it:

"-- lazy i/o integrity

  FIXME: currently missing call to flag an Fd/file has lazy.  used to be
O_LAZY on open, but no more.

  * relax data coherency
  * writes may not be visible until lazyio_propagate, fsync, close

  lazyio_propagate(int fd, off_t offset, size_t count);
   * my writes are safe

  lazyio_synchronize(int fd, off_t offset, size_t count);
   * i will see everyone else's propagated writes


lazyio_propagate / lazyio_synchronize. Those seem like they could be
implemented as ioctls if you don't care about other filesystems.

It is possible to add new open flags (we're running low, but that's a
problem we'll hit sooner or later anyway), but before we can do anything
here, O_LAZY needs to be defined in a way that makes sense for
application developers across filesystems.

How does this change behavior on ext4, xfs or btrfs, for instance? What
about nfs or cifs?

I suggest that before you even dive into writing patches for any of
this, that you draft a small manpage update for open(2). What would an
O_LAZY entry look like?

-- 
Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2017-11-20 13:53 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-10 16:49 Provision for filesystem specific open flags Fu, Rodney
2017-11-10 17:23 ` hch
2017-11-10 17:39   ` Fu, Rodney
2017-11-10 19:29     ` Matthew Wilcox
2017-11-10 21:04       ` Fu, Rodney
2017-11-11  0:37         ` Matthew Wilcox
2017-11-13 15:16           ` Fu, Rodney
2017-11-20 13:38             ` Jeff Layton
2017-11-13  0:48         ` Dave Chinner
2017-11-13 17:02           ` Fu, Rodney
2017-11-13 21:58             ` Dave Chinner
2017-11-14 17:35               ` Fu, Rodney
2017-11-20 13:53                 ` Jeff Layton [this message]
2017-12-04  5:29                 ` NeilBrown
2017-12-05 21:36                   ` Andreas Dilger
2017-11-13 17:45         ` Bernd Schubert
2017-11-13 20:19           ` Fu, Rodney
2017-11-20 14:03             ` Florian Weimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1511186021.4228.20.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=rfu@panasas.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).