linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Provision for filesystem specific open flags
       [not found]           ` <BN3PR0801MB2257249A7388086676CBA811AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com>
@ 2017-11-20 13:38             ` Jeff Layton
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff Layton @ 2017-11-20 13:38 UTC (permalink / raw)
  To: Fu, Rodney, Matthew Wilcox
  Cc: hch@lst.de, viro@zeniv.linux.org.uk,
	linux-fsdevel@vger.kernel.org, linux-api

On Mon, 2017-11-13 at 15:16 +0000, Fu, Rodney wrote:
> > > > No.  If you want new flags bits, make a public proposal.  Maybe some 
> > > > other filesystem would also benefit from them.
> > > 
> > > Ah, I see what you mean now, thanks.
> > > 
> > > I would like to propose O_CONCURRENT_WRITE as a new open flag.  It is 
> > > currently used in the Panasas filesystem (panfs) and defined with value:
> > > 
> > > #define O_CONCURRENT_WRITE 020000000000
> > > 
> > > This flag has been provided by panfs to HPC users via the mpich 
> > > package for well over a decade.  See:
> > > 
> > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan
> > > fs/ad_panfs_open6.c#L344
> > > 
> > > O_CONCURRENT_WRITE indicates to the filesystem that the application 
> > > doing the open is participating in a coordinated distributed manner 
> > > with other such applications, possibly running on different hosts.  
> > > This allows the panfs filesystem to delegate some of the cache 
> > > coherency responsibilities to the application, improving performance.
> > > 
> > > The reason this flag is used on open as opposed to having a post-open 
> > > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by 
> > > applications that attempt to access files that have already been 
> > > opened by applications that have set O_CONCURRENT_WRITE.
> > OK, let me just check I understand.  Once any application has opened the inode
> > with O_CONCURRENT_WRITE, all subsequent attempts to open the same inode without
> > O_CONCURRENT_WRITE will fail.  Presumably also if somebody already has the inode
> > open without O_CONCURRENT_WRITE set, the first open with O_CONCURRENT_WRITE will
> > fail?
> 
> Yes on both counts.  Opening with O_CONCURRENT_WRITE, followed by an open
> without will fail.  Opening without O_CONCURRENT_WRITE followed by one with it
> will also fail.
> 
> > Are opens with O_RDONLY also blocked?
> 
> No they are not.  The decision to grant access is based solely on the
> O_CONCURRENT_WRITE flag.
> 
> > This feels a lot like leases ... maybe there's an opportunity to give better
> > semantics here -- rather than rejecting opens without O_CONCURRENT_WRITE, all
> > existing users could be forced to use the stricter coherency model?
> 
> I don't think that will work, at least not from the perspective of trying to
> maintain good performance.  A user that does not open with O_CONCURRENT_WRITE
> does not know how to adhere to the proper access patterns that maintain
> coherency.  To continue to allow all users access after that point, the
> filesystem will have to force all users into a non-cacheable mode.  Instead, we
> reject stray opens to allow any existing CONCURRENT_WRITE application to
> complete in a higher performance mode.
> 

(added linux-api@vger.kernel.org to the cc list...)

Actually, it feels more like O_EXLOCK / O_SHLOCK to me:

    https://www.gnu.org/software/libc/manual/html_node/Open_002dtime-Flags.html

Those are not quite the same semantics as what you're describing for
O_CONCURRENT_WRITE, but the handling of conflicts would be similar. 

Maybe it's possible to dovetail your new flag on top of a credible
O_EXLOCK/O_SHLOCK implementation? It'd be nice to have those to
implement VFS-level share/deny locking. Most NFS and SMB servers could
make good use of it.


-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Provision for filesystem specific open flags
       [not found]                 ` <BN3PR0801MB2257BBAA9D7CA6CEDCA04DECAB280-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-11-20 13:53                   ` Jeff Layton
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff Layton @ 2017-11-20 13:53 UTC (permalink / raw)
  To: Fu, Rodney, Dave Chinner
  Cc: Matthew Wilcox, hch-jcswGhMUV9g@public.gmane.org,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api

On Tue, 2017-11-14 at 17:35 +0000, Fu, Rodney wrote:
> > The filesystem can still choose to do that for O_DIRECT if it wants - look at
> > all the filesystems that have a "fall back to buffered IO because this is too
> > hard to implement in the direct Io path".
> 
> Yes, I agree that the filesystem can still decide to buffer IO even with
> O_DIRECT, but the application's intent is that the effects of caching are
> minimized.  Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching.
> 
> > IOWs, you've got another set of custom userspace APIs that are needed to make
> > proper use of this open flag?
> 
> Yes and no.  Applications can make ioctls to the filesystem to query or set
> layout details but don't have to.  Directory level default layout attributes can
> be set up by an admin to meet the requirements of the application.
> 
> > > In panfs, a well behaved CONCURRENT_WRITE application will consider 
> > > the file's layout on storage.  Access from different machines will not 
> > > overlap within the same RAID stripe so as not to cause distributed 
> > > stripe lock contention.  Writes to the file that are page aligned can 
> > > be cached and the filesystem can aggregate multiple such writes before 
> > > writing out to storage.  Conversely, a CONCURRENT_WRITE application 
> > > that ends up colliding on the same stripe will see worse performance.  
> > > Non page aligned writes are treated by panfs as write-through and 
> > > non-cachable, as the filesystem will have to assume that the region of 
> > > the page that is untouched by this machine might in fact be written to 
> > > on another machine.  Caching such a page and writing it out later might lead to data corruption.
> > That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if
> > the app doesn't do correctly aligned and sized IO then performance is going to
> > suck, and if the apps doesn't serialize access to the file correctly it can and
> > will corrupt data in the file....
> 
> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have
> opposite intents with respect to caching.  Our filesystem handles them
> differently, so we need to distinguish between the two.
> 
> > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the 
> > > application does not have to implement any caching to see good performance.
> > Sure, but it has to be aware of layout and where/how it can write, which is
> > exactly the same constraints that local filesystems place on O_DIRECT access.
> > Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics
> > and behaviour, IMO.
> 
> I'd like to make a slight adjustment to my proposal.  The HPC community had
> talked about extensions to POSIX to include O_LAZY as a way for filesystems to
> relax data coherency requirements.  There is code in the ceph filesystem that
> uses that flag if defined.  Can we get O_LAZY defined?
> 
> HEC POSIX extension:
> http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf
> 
> Ceph usage of O_LAZY:
> https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78


O_LAZY support was removed from cephfs userland client in 2013:

    commit 94afedf02d07ad4678222aa66289a74b87768810
    Author: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
    Date:   Mon Jul 8 11:24:48 2013 -0700

        client: remove O_LAZY

...part of the problem (and this may just be my lack of understanding)
is that it's not clear what O_LAZY semantics actually are. The ceph
sources have a textfile with this in it:

"-- lazy i/o integrity

  FIXME: currently missing call to flag an Fd/file has lazy.  used to be
O_LAZY on open, but no more.

  * relax data coherency
  * writes may not be visible until lazyio_propagate, fsync, close

  lazyio_propagate(int fd, off_t offset, size_t count);
   * my writes are safe

  lazyio_synchronize(int fd, off_t offset, size_t count);
   * i will see everyone else's propagated writes


lazyio_propagate / lazyio_synchronize. Those seem like they could be
implemented as ioctls if you don't care about other filesystems.

It is possible to add new open flags (we're running low, but that's a
problem we'll hit sooner or later anyway), but before we can do anything
here, O_LAZY needs to be defined in a way that makes sense for
application developers across filesystems.

How does this change behavior on ext4, xfs or btrfs, for instance? What
about nfs or cifs?

I suggest that before you even dive into writing patches for any of
this, that you draft a small manpage update for open(2). What would an
O_LAZY entry look like?

-- 
Jeff Layton <jlayton-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Provision for filesystem specific open flags
       [not found]             ` <BN3PR0801MB2257378E7F3596E0E61C1F89AB2B0-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-11-20 14:03               ` Florian Weimer
  0 siblings, 0 replies; 3+ messages in thread
From: Florian Weimer @ 2017-11-20 14:03 UTC (permalink / raw)
  To: Fu, Rodney, Bernd Schubert, Matthew Wilcox
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API

On 11/13/2017 09:19 PM, Fu, Rodney wrote:
> Yes, an ioctl open is possible but not ideal.  The interface would require an
> additional open to perform the ioctl against.  The open system call is really a
> great place to pass control information to the filesystem and any other solution
> seems less elegant.

But with the FS-specific open flag, you would have to do an open call 
with O_PATH, check that the file system is what you expect, and then 
openat the O_PATH descriptor to get a full descriptor.  If you don't 
follow this protocol, you might end up using a custom open flag with a 
different file system which has completely different semantics for the flag.

So ioctl actually is much simpler here and needs fewer system calls.

(Due to per-file bind mounts, there is no way to figure out the file 
system on which a file is located without actually opening the file.)

Thanks,
Florian

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-11-20 14:03 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <BN3PR0801MB2257E7D90F26A85C1D16730EAB540@BN3PR0801MB2257.namprd08.prod.outlook.com>
     [not found] ` <20171110172344.GA15288@lst.de>
     [not found]   ` <BN3PR0801MB2257E71C2A12EA41C77FF7EBAB540@BN3PR0801MB2257.namprd08.prod.outlook.com>
     [not found]     ` <20171110192902.GA10339@bombadil.infradead.org>
     [not found]       ` <BN3PR0801MB22576444104088CEDD24DE7DAB540@BN3PR0801MB2257.namprd08.prod.outlook.com>
     [not found]         ` <20171111003721.GA9546@bombadil.infradead.org>
     [not found]           ` <BN3PR0801MB2257249A7388086676CBA811AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com>
2017-11-20 13:38             ` Provision for filesystem specific open flags Jeff Layton
     [not found]         ` <20171113004855.GV4094@dastard>
     [not found]           ` <BN3PR0801MB225771FD9BBD14A99A9358F7AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com>
     [not found]             ` <20171113215847.GY4094@dastard>
     [not found]               ` <BN3PR0801MB2257BBAA9D7CA6CEDCA04DECAB280@BN3PR0801MB2257.namprd08.prod.outlook.com>
     [not found]                 ` <BN3PR0801MB2257BBAA9D7CA6CEDCA04DECAB280-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-11-20 13:53                   ` Jeff Layton
     [not found]         ` <15b7fb2c-0ab7-014c-025f-b95d254e75d0@fastmail.fm>
     [not found]           ` <BN3PR0801MB2257378E7F3596E0E61C1F89AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com>
     [not found]             ` <BN3PR0801MB2257378E7F3596E0E61C1F89AB2B0-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-11-20 14:03               ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).