All of lore.kernel.org
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>, Sage Weil <sage@inktank.com>,
	Mark Fasheh <mfasheh@suse.com>,
	xfs@oss.sgi.com, Steve French <sfrench@samba.org>,
	Joel Becker <jlbec@evilplan.org>,
	linux-fsdevel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 0/5] splice: locking changes and code refactoring
Date: Tue, 14 Jan 2014 17:20:33 +0000	[thread overview]
Message-ID: <20140114172033.GU10323@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20140114132207.GA25170@infradead.org>

On Tue, Jan 14, 2014 at 05:22:07AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote:
> > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote:
> > > ping?  Would be nice to get this into 3.14
> > 
> > Umm...  The reason for pipe_lock outside of ->i_mutex is this:
> > default_file_splice_write() calls splice_from_pipe() with
> > write_pipe_buf for callback.  splice_from_pipe() calls that
> > callback under pipe_lock(pipe).  And write_pipe_buf() calls
> > __kernel_write(), which certainly might want to take ->i_mutex.
> > 
> > Now, this codepath isn't taken for files that have non-NULL
> > ->splice_write(), so that's not an issue for XFS and OCFS2,
> > but having pipe_lock nest between the ->i_mutex for filesystems
> > that do and do not have ->splice_write()...  Ouch...
> 
> What would be the alternative?  Duplicating the code in even more
> filesystems to enforce an non-natural locking order for filesystems
> actually implementing splice?  There don't actually seem to be a whole
> lot of real filesystems not implemting splice_write, the prime use
> would be for device drivers or synthetic ones.  I'm not even sure
> how much that fallback gets used in practice.

Actually, I wonder if splice should be allowed just because destination
has ->write() - for a lot of synthetic files the effect of writes really
depends on boundaries in output stream and I'm not sure if splice to
such a beast makes any sense.

Going through fs/*:
	* 9p: messy when we have O_DIRECT on target file.  Other than that,
probably could use generic_file_splice_write().
	* adfs, affs, afs, bfs, fat, hfs, hfs+, minix, sysv: can use
generic_file_splice_write()
	* /proc/fs/afs/*: probably shouldn't allow splice at all
	* binfmt_misc: shouldn't allow splice
	* btrfs: not sure, O_DIRECT complicates the picture again.
	* cachefiles_daemon_write(): probably shouldn't allow splice at all
	* ceph: uses generic_file_splice_write(), but I'm not sure whether
it is correct - there are interesting things done by ceph_aio_write() that
do not have any counterparts on that path.
	* cifs: trouble; might be switchable to generic, but
there's an interesting bit in the end of its ->aio_write() to consider...
In any case, should be doable as generic_file_splice_write() +
filemap_fdatawrite().
	* cifs mounted with strict_io - ouch.  Would need ->splice_write() of
its own with really interesting locking order.
	* cifs with direct_io - about the same story.  Er...  Just where
is it doing suid removal in that case, BTW?  cifs_user_writev() doesn't
seem to?
	* /proc/fs/cifs/*: shouldn't allow splice to it
	* coda: needs ->splice_write() with that patch, might or might not
be tricky
	* coda misc device: shouldn't allow splice to it
	* configfs: shouldn't allow splice to it
	* debugfs default fops: blackhole ->write(), might as well offer
blackhole ->splice_write().  Or refuse to allow splice to it, since I'd
bet that most of non-default ones are of the "shouldn't allow splice to it"
kind...
	* debugfs bool: shouldn't allow splice to it
	* dlm misc devices: shouldn't allow splice to it
	* ecryptfs: probably should use generic_file_splice_write()
(incidentally, who the hell has thought it would be a good idea to
put ->iterate (aka ->readdir) into file_operations of a non-directory?)
	* ecryptfs misc device: probably shouldn't allow splice to it at all
	* eventfd: shouldn't allow splice to it at all
	* XIP ext2: needs ->splice_write() with that patch

And that's less than half of fs/*...  I'm not saying that the current
situation on the write side is good; hell, just the mess with write/aio_write
alone is ugly enough - we have
	* a bunch of file_operations without ->aio_write(); simple enough.
	* a bunch with ->write == do_sync_write.  Also simple.
	* several with NULL ->write and non-NULL ->aio_write(); same as
do_sync_write() for ->write (socket, android/logger, kmsg, macvtap)
	* several with ->aio_write being an optimized equivalent of
do_sync_write() (blackhole for /dev/null and /dev/zero, error for bad_inode)
	* 9p cached with its "oh, but if we have O_DIRECT we want ->write()
to be different" (why not use a separate file_operations, then?  It's not
as if ->open() couldn't switch to it if it sees O_DIRECT...)
	* two infinibad things (ipath and qib), with completely unrelated
semantics on write(2) and writev(2) (the latter shared with aio).  As in
"writev() of a single-element iovec array does things that do not even
resemble what write() of the same data would've done".  Yes, really - check
for yourself.
	* snd_pcm - hell knows; it might be that it tries to collect the
data from iovec and push it in one go, as if it was a single write, but
then it might be something as bogus as what ipath is doing...
	* gadgetfs - hell knows; ep_write() seems to be doing something
beyond what ep_aio_write() does, but I haven't traced them down the call
chain...  That one, BTW, won't be fun for splice - looks like it cares
about datagram boundaries a lot, so it's not obvious what the semantics
should be.
	* lustre.  I _think_ do_sync_write() would work there, but I'm might
be easily missing something in all those layers of obfusca^Wgood software
development practices.

And ->splice_write() thrown into that mess, defaulting to "just do
->write() or ->aio_write(), everything writable should be able to cope
with splice to it" hasn't made it any prettier.  Unfortunately, what
you are proposing will make it worse - we'll have to grow a bunch of
->splice_write() instances, with non-trivial correspondence between
them and ->aio_write() ones...

It needs to be cleaned up, but it's nowhere near as simple as "just flip
the order of i_mutex and pipe_lock" ;-/

BTW, speaking of ->aio_write() - why the devil do we pass the pos
argument (long long, at that) separately, when all call sites provably
have it equal to iocb->ki_pos?  If nothing else, on a bunch of architectures
it makes the difference between passing arguments in registers and spilling
them on stack; moreover, if we do something and only then call
generic_file_aio_write(), we get to propagate it all way down.  And
generic_file_aio_write() has had explicit BUG_ON(iocb->ki_pos != pos)
since 2.5.55, for crying out loud...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

WARNING: multiple messages have this Message-ID (diff)
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>, Mark Fasheh <mfasheh@suse.com>,
	Joel Becker <jlbec@evilplan.org>,
	linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Sage Weil <sage@inktank.com>, Steve French <sfrench@samba.org>
Subject: Re: [PATCH 0/5] splice: locking changes and code refactoring
Date: Tue, 14 Jan 2014 17:20:33 +0000	[thread overview]
Message-ID: <20140114172033.GU10323@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20140114132207.GA25170@infradead.org>

On Tue, Jan 14, 2014 at 05:22:07AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote:
> > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote:
> > > ping?  Would be nice to get this into 3.14
> > 
> > Umm...  The reason for pipe_lock outside of ->i_mutex is this:
> > default_file_splice_write() calls splice_from_pipe() with
> > write_pipe_buf for callback.  splice_from_pipe() calls that
> > callback under pipe_lock(pipe).  And write_pipe_buf() calls
> > __kernel_write(), which certainly might want to take ->i_mutex.
> > 
> > Now, this codepath isn't taken for files that have non-NULL
> > ->splice_write(), so that's not an issue for XFS and OCFS2,
> > but having pipe_lock nest between the ->i_mutex for filesystems
> > that do and do not have ->splice_write()...  Ouch...
> 
> What would be the alternative?  Duplicating the code in even more
> filesystems to enforce an non-natural locking order for filesystems
> actually implementing splice?  There don't actually seem to be a whole
> lot of real filesystems not implemting splice_write, the prime use
> would be for device drivers or synthetic ones.  I'm not even sure
> how much that fallback gets used in practice.

Actually, I wonder if splice should be allowed just because destination
has ->write() - for a lot of synthetic files the effect of writes really
depends on boundaries in output stream and I'm not sure if splice to
such a beast makes any sense.

Going through fs/*:
	* 9p: messy when we have O_DIRECT on target file.  Other than that,
probably could use generic_file_splice_write().
	* adfs, affs, afs, bfs, fat, hfs, hfs+, minix, sysv: can use
generic_file_splice_write()
	* /proc/fs/afs/*: probably shouldn't allow splice at all
	* binfmt_misc: shouldn't allow splice
	* btrfs: not sure, O_DIRECT complicates the picture again.
	* cachefiles_daemon_write(): probably shouldn't allow splice at all
	* ceph: uses generic_file_splice_write(), but I'm not sure whether
it is correct - there are interesting things done by ceph_aio_write() that
do not have any counterparts on that path.
	* cifs: trouble; might be switchable to generic, but
there's an interesting bit in the end of its ->aio_write() to consider...
In any case, should be doable as generic_file_splice_write() +
filemap_fdatawrite().
	* cifs mounted with strict_io - ouch.  Would need ->splice_write() of
its own with really interesting locking order.
	* cifs with direct_io - about the same story.  Er...  Just where
is it doing suid removal in that case, BTW?  cifs_user_writev() doesn't
seem to?
	* /proc/fs/cifs/*: shouldn't allow splice to it
	* coda: needs ->splice_write() with that patch, might or might not
be tricky
	* coda misc device: shouldn't allow splice to it
	* configfs: shouldn't allow splice to it
	* debugfs default fops: blackhole ->write(), might as well offer
blackhole ->splice_write().  Or refuse to allow splice to it, since I'd
bet that most of non-default ones are of the "shouldn't allow splice to it"
kind...
	* debugfs bool: shouldn't allow splice to it
	* dlm misc devices: shouldn't allow splice to it
	* ecryptfs: probably should use generic_file_splice_write()
(incidentally, who the hell has thought it would be a good idea to
put ->iterate (aka ->readdir) into file_operations of a non-directory?)
	* ecryptfs misc device: probably shouldn't allow splice to it at all
	* eventfd: shouldn't allow splice to it at all
	* XIP ext2: needs ->splice_write() with that patch

And that's less than half of fs/*...  I'm not saying that the current
situation on the write side is good; hell, just the mess with write/aio_write
alone is ugly enough - we have
	* a bunch of file_operations without ->aio_write(); simple enough.
	* a bunch with ->write == do_sync_write.  Also simple.
	* several with NULL ->write and non-NULL ->aio_write(); same as
do_sync_write() for ->write (socket, android/logger, kmsg, macvtap)
	* several with ->aio_write being an optimized equivalent of
do_sync_write() (blackhole for /dev/null and /dev/zero, error for bad_inode)
	* 9p cached with its "oh, but if we have O_DIRECT we want ->write()
to be different" (why not use a separate file_operations, then?  It's not
as if ->open() couldn't switch to it if it sees O_DIRECT...)
	* two infinibad things (ipath and qib), with completely unrelated
semantics on write(2) and writev(2) (the latter shared with aio).  As in
"writev() of a single-element iovec array does things that do not even
resemble what write() of the same data would've done".  Yes, really - check
for yourself.
	* snd_pcm - hell knows; it might be that it tries to collect the
data from iovec and push it in one go, as if it was a single write, but
then it might be something as bogus as what ipath is doing...
	* gadgetfs - hell knows; ep_write() seems to be doing something
beyond what ep_aio_write() does, but I haven't traced them down the call
chain...  That one, BTW, won't be fun for splice - looks like it cares
about datagram boundaries a lot, so it's not obvious what the semantics
should be.
	* lustre.  I _think_ do_sync_write() would work there, but I'm might
be easily missing something in all those layers of obfusca^Wgood software
development practices.

And ->splice_write() thrown into that mess, defaulting to "just do
->write() or ->aio_write(), everything writable should be able to cope
with splice to it" hasn't made it any prettier.  Unfortunately, what
you are proposing will make it worse - we'll have to grow a bunch of
->splice_write() instances, with non-trivial correspondence between
them and ->aio_write() ones...

It needs to be cleaned up, but it's nowhere near as simple as "just flip
the order of i_mutex and pipe_lock" ;-/

BTW, speaking of ->aio_write() - why the devil do we pass the pos
argument (long long, at that) separately, when all call sites provably
have it equal to iocb->ki_pos?  If nothing else, on a bunch of architectures
it makes the difference between passing arguments in registers and spilling
them on stack; moreover, if we do something and only then call
generic_file_aio_write(), we get to propagate it all way down.  And
generic_file_aio_write() has had explicit BUG_ON(iocb->ki_pos != pos)
since 2.5.55, for crying out loud...

  reply	other threads:[~2014-01-14 17:20 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
2013-12-12 18:14 ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 1/5] splice: move balance_dirty_pages_ratelimited into pipe_to_file Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 2/5] splice: nest i_mutex outside pipe_lock Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 3/5] splice: use splice_from_pipe in generic_file_splice_write Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 4/5] xfs: fix splice_write locking Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 5/5] splice: stop exporting splice_from_pipe implementation details Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2014-01-13 14:14 ` [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
2014-01-13 14:14   ` Christoph Hellwig
2014-01-13 23:56   ` Al Viro
2014-01-13 23:56     ` Al Viro
2014-01-14 13:22     ` Christoph Hellwig
2014-01-14 13:22       ` Christoph Hellwig
2014-01-14 17:20       ` Al Viro [this message]
2014-01-14 17:20         ` Al Viro
2014-01-15 18:10         ` Al Viro
2014-01-15 18:10           ` Al Viro
2014-01-18  6:40         ` Al Viro
2014-01-18  7:22           ` Linus Torvalds
2014-01-18  7:22             ` Linus Torvalds
2014-01-18  7:46             ` Al Viro
2014-01-18  7:56               ` Al Viro
2014-01-18  7:56                 ` Al Viro
2014-01-18  8:27               ` Al Viro
2014-01-18  8:44                 ` David Miller
2014-01-18  8:44                   ` David Miller
2014-02-07 17:10                   ` Al Viro
2014-02-07 17:10                     ` Al Viro
2014-01-18 19:59               ` Linus Torvalds
2014-01-18 20:10                 ` Al Viro
2014-01-18 20:27                   ` Al Viro
2014-01-18 20:27                     ` Al Viro
2014-01-18 20:30                     ` Al Viro
2014-01-18 20:30                       ` Al Viro
2014-01-19  5:13                   ` [RFC] unifying write variants for filesystems Al Viro
2014-01-19  5:13                     ` Al Viro
2014-01-20 13:55                     ` Christoph Hellwig
2014-01-20 13:55                       ` Christoph Hellwig
2014-01-20 20:32                       ` Linus Torvalds
2014-01-20 20:32                         ` Linus Torvalds
2014-02-01 22:43                         ` Al Viro
2014-02-01 22:43                           ` Al Viro
2014-02-02  0:13                           ` Linus Torvalds
2014-02-02  2:02                             ` Al Viro
2014-02-02  2:02                               ` Al Viro
2014-02-02 19:21                           ` Al Viro
2014-02-02 19:21                             ` Al Viro
2014-02-02 19:23                             ` Al Viro
2014-02-02 19:23                               ` Al Viro
2014-02-03 14:41                             ` Miklos Szeredi
2014-02-03 14:41                               ` Miklos Szeredi
2014-02-03 15:33                               ` Al Viro
2014-02-03 15:33                                 ` Al Viro
2014-02-02 23:16                           ` Anton Altaparmakov
2014-02-02 23:16                             ` Anton Altaparmakov
2014-02-03 15:12                           ` Christoph Hellwig
2014-02-03 16:24                             ` Al Viro
2014-02-03 16:50                             ` Dave Kleikamp
2014-02-03 16:23                           ` Dave Kleikamp
2014-02-04 12:44                             ` Al Viro
2014-02-04 12:44                               ` Al Viro
2014-02-04 12:52                               ` Kent Overstreet
2014-02-04 12:52                                 ` Kent Overstreet
2014-02-04 15:17                                 ` Al Viro
2014-02-04 15:17                                   ` Al Viro
2014-02-04 17:27                                   ` Zach Brown
2014-02-04 17:35                                     ` Kent Overstreet
2014-02-04 18:08                                       ` Al Viro
2014-02-04 18:08                                         ` Al Viro
2014-02-04 18:00                                     ` Al Viro
2014-02-04 18:00                                       ` Al Viro
2014-02-04 18:33                                       ` Zach Brown
2014-02-04 18:36                                         ` Al Viro
2014-02-04 18:36                                           ` Al Viro
2014-02-05 19:58                                           ` Al Viro
2014-02-05 20:42                                             ` Zach Brown
2014-02-06  9:08                                             ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140114172033.GU10323@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=axboe@kernel.dk \
    --cc=hch@infradead.org \
    --cc=jlbec@evilplan.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mfasheh@suse.com \
    --cc=sage@inktank.com \
    --cc=sfrench@samba.org \
    --cc=torvalds@linux-foundation.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.