* Re: Provision for filesystem specific open flags [not found] ` <BN3PR0801MB2257249A7388086676CBA811AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com> @ 2017-11-20 13:38 ` Jeff Layton 0 siblings, 0 replies; 3+ messages in thread From: Jeff Layton @ 2017-11-20 13:38 UTC (permalink / raw) To: Fu, Rodney, Matthew Wilcox Cc: hch@lst.de, viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org, linux-api On Mon, 2017-11-13 at 15:16 +0000, Fu, Rodney wrote: > > > > No. If you want new flags bits, make a public proposal. Maybe some > > > > other filesystem would also benefit from them. > > > > > > Ah, I see what you mean now, thanks. > > > > > > I would like to propose O_CONCURRENT_WRITE as a new open flag. It is > > > currently used in the Panasas filesystem (panfs) and defined with value: > > > > > > #define O_CONCURRENT_WRITE 020000000000 > > > > > > This flag has been provided by panfs to HPC users via the mpich > > > package for well over a decade. See: > > > > > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan > > > fs/ad_panfs_open6.c#L344 > > > > > > O_CONCURRENT_WRITE indicates to the filesystem that the application > > > doing the open is participating in a coordinated distributed manner > > > with other such applications, possibly running on different hosts. > > > This allows the panfs filesystem to delegate some of the cache > > > coherency responsibilities to the application, improving performance. > > > > > > The reason this flag is used on open as opposed to having a post-open > > > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by > > > applications that attempt to access files that have already been > > > opened by applications that have set O_CONCURRENT_WRITE. > > OK, let me just check I understand. Once any application has opened the inode > > with O_CONCURRENT_WRITE, all subsequent attempts to open the same inode without > > O_CONCURRENT_WRITE will fail. Presumably also if somebody already has the inode > > open without O_CONCURRENT_WRITE set, the first open with O_CONCURRENT_WRITE will > > fail? > > Yes on both counts. Opening with O_CONCURRENT_WRITE, followed by an open > without will fail. Opening without O_CONCURRENT_WRITE followed by one with it > will also fail. > > > Are opens with O_RDONLY also blocked? > > No they are not. The decision to grant access is based solely on the > O_CONCURRENT_WRITE flag. > > > This feels a lot like leases ... maybe there's an opportunity to give better > > semantics here -- rather than rejecting opens without O_CONCURRENT_WRITE, all > > existing users could be forced to use the stricter coherency model? > > I don't think that will work, at least not from the perspective of trying to > maintain good performance. A user that does not open with O_CONCURRENT_WRITE > does not know how to adhere to the proper access patterns that maintain > coherency. To continue to allow all users access after that point, the > filesystem will have to force all users into a non-cacheable mode. Instead, we > reject stray opens to allow any existing CONCURRENT_WRITE application to > complete in a higher performance mode. > (added linux-api@vger.kernel.org to the cc list...) Actually, it feels more like O_EXLOCK / O_SHLOCK to me: https://www.gnu.org/software/libc/manual/html_node/Open_002dtime-Flags.html Those are not quite the same semantics as what you're describing for O_CONCURRENT_WRITE, but the handling of conflicts would be similar. Maybe it's possible to dovetail your new flag on top of a credible O_EXLOCK/O_SHLOCK implementation? It'd be nice to have those to implement VFS-level share/deny locking. Most NFS and SMB servers could make good use of it. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <20171113004855.GV4094@dastard>]
[parent not found: <BN3PR0801MB225771FD9BBD14A99A9358F7AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com>]
[parent not found: <20171113215847.GY4094@dastard>]
[parent not found: <BN3PR0801MB2257BBAA9D7CA6CEDCA04DECAB280@BN3PR0801MB2257.namprd08.prod.outlook.com>]
[parent not found: <BN3PR0801MB2257BBAA9D7CA6CEDCA04DECAB280-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>]
* Re: Provision for filesystem specific open flags [not found] ` <BN3PR0801MB2257BBAA9D7CA6CEDCA04DECAB280-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org> @ 2017-11-20 13:53 ` Jeff Layton 0 siblings, 0 replies; 3+ messages in thread From: Jeff Layton @ 2017-11-20 13:53 UTC (permalink / raw) To: Fu, Rodney, Dave Chinner Cc: Matthew Wilcox, hch-jcswGhMUV9g@public.gmane.org, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api On Tue, 2017-11-14 at 17:35 +0000, Fu, Rodney wrote: > > The filesystem can still choose to do that for O_DIRECT if it wants - look at > > all the filesystems that have a "fall back to buffered IO because this is too > > hard to implement in the direct Io path". > > Yes, I agree that the filesystem can still decide to buffer IO even with > O_DIRECT, but the application's intent is that the effects of caching are > minimized. Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching. > > > IOWs, you've got another set of custom userspace APIs that are needed to make > > proper use of this open flag? > > Yes and no. Applications can make ioctls to the filesystem to query or set > layout details but don't have to. Directory level default layout attributes can > be set up by an admin to meet the requirements of the application. > > > > In panfs, a well behaved CONCURRENT_WRITE application will consider > > > the file's layout on storage. Access from different machines will not > > > overlap within the same RAID stripe so as not to cause distributed > > > stripe lock contention. Writes to the file that are page aligned can > > > be cached and the filesystem can aggregate multiple such writes before > > > writing out to storage. Conversely, a CONCURRENT_WRITE application > > > that ends up colliding on the same stripe will see worse performance. > > > Non page aligned writes are treated by panfs as write-through and > > > non-cachable, as the filesystem will have to assume that the region of > > > the page that is untouched by this machine might in fact be written to > > > on another machine. Caching such a page and writing it out later might lead to data corruption. > > That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if > > the app doesn't do correctly aligned and sized IO then performance is going to > > suck, and if the apps doesn't serialize access to the file correctly it can and > > will corrupt data in the file.... > > I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have > opposite intents with respect to caching. Our filesystem handles them > differently, so we need to distinguish between the two. > > > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the > > > application does not have to implement any caching to see good performance. > > Sure, but it has to be aware of layout and where/how it can write, which is > > exactly the same constraints that local filesystems place on O_DIRECT access. > > Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics > > and behaviour, IMO. > > I'd like to make a slight adjustment to my proposal. The HPC community had > talked about extensions to POSIX to include O_LAZY as a way for filesystems to > relax data coherency requirements. There is code in the ceph filesystem that > uses that flag if defined. Can we get O_LAZY defined? > > HEC POSIX extension: > http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf > > Ceph usage of O_LAZY: > https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78 O_LAZY support was removed from cephfs userland client in 2013: commit 94afedf02d07ad4678222aa66289a74b87768810 Author: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> Date: Mon Jul 8 11:24:48 2013 -0700 client: remove O_LAZY ...part of the problem (and this may just be my lack of understanding) is that it's not clear what O_LAZY semantics actually are. The ceph sources have a textfile with this in it: "-- lazy i/o integrity FIXME: currently missing call to flag an Fd/file has lazy. used to be O_LAZY on open, but no more. * relax data coherency * writes may not be visible until lazyio_propagate, fsync, close lazyio_propagate(int fd, off_t offset, size_t count); * my writes are safe lazyio_synchronize(int fd, off_t offset, size_t count); * i will see everyone else's propagated writes lazyio_propagate / lazyio_synchronize. Those seem like they could be implemented as ioctls if you don't care about other filesystems. It is possible to add new open flags (we're running low, but that's a problem we'll hit sooner or later anyway), but before we can do anything here, O_LAZY needs to be defined in a way that makes sense for application developers across filesystems. How does this change behavior on ext4, xfs or btrfs, for instance? What about nfs or cifs? I suggest that before you even dive into writing patches for any of this, that you draft a small manpage update for open(2). What would an O_LAZY entry look like? -- Jeff Layton <jlayton-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <15b7fb2c-0ab7-014c-025f-b95d254e75d0@fastmail.fm>]
[parent not found: <BN3PR0801MB2257378E7F3596E0E61C1F89AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com>]
[parent not found: <BN3PR0801MB2257378E7F3596E0E61C1F89AB2B0-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>]
* Re: Provision for filesystem specific open flags [not found] ` <BN3PR0801MB2257378E7F3596E0E61C1F89AB2B0-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org> @ 2017-11-20 14:03 ` Florian Weimer 0 siblings, 0 replies; 3+ messages in thread From: Florian Weimer @ 2017-11-20 14:03 UTC (permalink / raw) To: Fu, Rodney, Bernd Schubert, Matthew Wilcox Cc: hch-jcswGhMUV9g@public.gmane.org, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API On 11/13/2017 09:19 PM, Fu, Rodney wrote: > Yes, an ioctl open is possible but not ideal. The interface would require an > additional open to perform the ioctl against. The open system call is really a > great place to pass control information to the filesystem and any other solution > seems less elegant. But with the FS-specific open flag, you would have to do an open call with O_PATH, check that the file system is what you expect, and then openat the O_PATH descriptor to get a full descriptor. If you don't follow this protocol, you might end up using a custom open flag with a different file system which has completely different semantics for the flag. So ioctl actually is much simpler here and needs fewer system calls. (Due to per-file bind mounts, there is no way to figure out the file system on which a file is located without actually opening the file.) Thanks, Florian ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2017-11-20 14:03 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <BN3PR0801MB2257E7D90F26A85C1D16730EAB540@BN3PR0801MB2257.namprd08.prod.outlook.com>
[not found] ` <20171110172344.GA15288@lst.de>
[not found] ` <BN3PR0801MB2257E71C2A12EA41C77FF7EBAB540@BN3PR0801MB2257.namprd08.prod.outlook.com>
[not found] ` <20171110192902.GA10339@bombadil.infradead.org>
[not found] ` <BN3PR0801MB22576444104088CEDD24DE7DAB540@BN3PR0801MB2257.namprd08.prod.outlook.com>
[not found] ` <20171111003721.GA9546@bombadil.infradead.org>
[not found] ` <BN3PR0801MB2257249A7388086676CBA811AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com>
2017-11-20 13:38 ` Provision for filesystem specific open flags Jeff Layton
[not found] ` <20171113004855.GV4094@dastard>
[not found] ` <BN3PR0801MB225771FD9BBD14A99A9358F7AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com>
[not found] ` <20171113215847.GY4094@dastard>
[not found] ` <BN3PR0801MB2257BBAA9D7CA6CEDCA04DECAB280@BN3PR0801MB2257.namprd08.prod.outlook.com>
[not found] ` <BN3PR0801MB2257BBAA9D7CA6CEDCA04DECAB280-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-11-20 13:53 ` Jeff Layton
[not found] ` <15b7fb2c-0ab7-014c-025f-b95d254e75d0@fastmail.fm>
[not found] ` <BN3PR0801MB2257378E7F3596E0E61C1F89AB2B0@BN3PR0801MB2257.namprd08.prod.outlook.com>
[not found] ` <BN3PR0801MB2257378E7F3596E0E61C1F89AB2B0-1I06WyKSH1RpbkYrVjfdjVJr2SjL+wq6nBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-11-20 14:03 ` Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).