All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Kleikamp <dave.kleikamp@oracle.com>
To: Al Viro <viro@ZenIV.linux.org.uk>,
	Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>, Steve French <sfrench@samba.org>,
	Sage Weil <sage@inktank.com>, Mark Fasheh <mfasheh@suse.com>,
	xfs@oss.sgi.com, Christoph Hellwig <hch@infradead.org>,
	Joel Becker <jlbec@evilplan.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Zach Brown <zab@zabbo.net>, Anton Altaparmakov <anton@tuxera.com>
Subject: Re: [RFC] unifying write variants for filesystems
Date: Mon, 03 Feb 2014 10:23:13 -0600	[thread overview]
Message-ID: <52EFC271.3090205@oracle.com> (raw)
In-Reply-To: <20140201224301.GS10323@ZenIV.linux.org.uk>

On 02/01/2014 04:43 PM, Al Viro wrote:
> On Mon, Jan 20, 2014 at 12:32:01PM -0800, Linus Torvalds wrote:
>> On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>> On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote:
>>>> Folks, what do you think about the following:
>>>
>>> That's very much what Shaggy did in the aio-direct tree, except that
>>> it kept using a single set of methods.  Linus really didn't like it
>>> unfortunately.
>>
>> Umm. That wasn't what I objected to.
>>
>> I objected to the incredibly ugly implementation, the crazy new flags
>> arguments for core helper functions, ugly naming etc etc. I even
>> outlined what might fix it.
>>
>> In other words, I thought the code was shit and ugly. Not that
>> iterators per se would be wrong. Just doing them badly is wrong.
> 
> Gyahhh...  OK, I should've known better than go looking into that thing
> after such warning.  Some relatively printable notes (i.e. about 10%
> of the comments I really had about that) follow:

Thanks for the feedback. I'd been asking for feedback on this patchset
for some time now, and have not received very much.

This is all based on some years-old work by Zach Brown that he probably
wishes would have disappeared by now. I pretty much left what I could
alone since 1) it was working, and 2) I didn't hear any objections
(until now).

It's clear now that the patchset isn't close to mergable, so treat it
like a proof-of-concept and we can come up with a better container and
read/write interface. I won't respond individually to your comments, but
will take them all into consideration going forward.

Thanks,
Shaggy

> * WTF bother passing 'pos' separately?  It's the same mistake that was
> made with ->aio_read/->aio_write and just as with those, *all* callers
> provably have pos == iocb->ki_pos.
> 
> * I'm not sure that iov_iter is a good fit here.  OTOH, it probably could
> be reused (it has damn few users right now and they are on the codepaths
> involved into that thing).
> 
> * We *definitely* want a variant structure with tag - unsigned long thing
> was just plain insane.  I see at least two variants - array of iovecs
> and array of (at least) triples <page, offset, length>.  Quite possibly -
> quadruples, with "here's how to try to steal this page" thrown in, if
> we want that as replacement for ->splice_write() as well (it looks like
> the few instances that do steal on pipe-to-file splices could be dealt
> with the same way as the dumb ones, provided that ->write_iter or whatever
> we end up calling it is allowed to try and steal pages).   Possibly more
> variants on the read side of things...  FWIW, I'm not sure that bio_vec
> makes a lot of sense here.
> 
> * direction (i.e. source vs. destination) also should be a part of that
> tag.  Which, BTW, turns ->direct_IO() into iocb x iov_iter -> int;
> the situation with pos is identical to aio_read/aio_write.  While we
> are at it, KERNEL_WRITE thing (used only for __swap_writepage()) looks
> very much like a special case of "array of <page,offset,size>" we want
> for splice_write...
> 
> * having looked through the ->read and ->write instances in drivers,
> I'd say that surprisingly few helpers would go a _long_ way towards
> converting those guys to the same methods.  And we need such helpers
> anyway - there's a whole lot of (badly) open-coded "copy the whole
> user buffer into kmalloc'ed array and NUL-terminate the sucker" logics
> in ->write() instances, for example.  Even more "copy up to that much
> into given array and NUL-terminate it", with rather amusing bugs in
> there - e.g. NUL going into the end of array, regardless of the actual
> amount of data to copy; junk is left in the middle, complete with
> printk of the entire thing if it doesn't make sense.  IOW, spewing
> random piece of kernel stack into dmesg.  Off-by-ones galore, too...
> 
> BTW, speaking of bogosities - grep for generic_write_checks().  Exactly
> one caller (in __generic_file_aio_write()) has any business looking
> at S_ISBLK(inode->i_mode) - it can be called by blkdev_aio_write().
> All other callers have copied that, even though it makes absolutely
> no sense for them...
> 
> * file_readable/file_writable looks really wrong; if nothing else, I would
> rather check that after open() and set a couple of FMODE bits, then check
> those.  Possibly even "knock FMODE_READ/FMODE_WRITE out if there's no
> method"...
>
> * pipe_buffer_operations ->map()/->unmap() should die; let the caller do
> k{,un}map{,_atomic}().  All instances have the same method there and
> there's no point to make it different.  PIPE_BUF_FLAG_ATOMIC should also
> go.
> 
> * WTF do iov_iter_copy_from_user_atomic() callers keep doing pagefault_disable
> and pagefault_enable around it?  The sucker starts with kmap_atomic() and
> ends with kunmap_atomic(); all instances of those guys have pagefaul
> disabling/enabling (and I suspect that it might make sense to lift it
> out of arch-specific variants - rename them to e.g. __kmap_atomic(),
> rip pagefault_disable() out of those and put kmap_atomic() into highmem.h
> outside of ifdef, with !HIGHMEM side of ifdef having #define __kmap_atomic
> page_address; move pagefault_enable() from __kunmap_atomic() to
> kunmap_atomic() while we are at it).
> 
> Note that e.g. pipe.c relies on kmap_atomic() disabling pagefaults - we
> have __copy_from_user_inatomic() done there under kmap_atomic(), and we
> really don't want to block in such conditions.
> 
> * pipe_iov_copy_from_user() ought to return how much has it managed to
> bring in, not 0 or -EFAULT as it does now.  Then it won't need the
> non-atomic side, AFAICS.  Moreover, we'll just be able to use
> iov_iter_copy_from_user_atomic() (which badly needs a more palatable
> name, BTW).
> 
> * sure, we want to call do_generic_file_read() once, passing the entire
> iovec to file_read_actor().  But what the hell does that have to do with
> introduction of new methods?  It's a change that makes sense on its
> own.  Moreover, it's a damn good preparation to adding those - we
> get generic_file_aio_read() into "set iov_iter up, then do <this>",
> with <this> using iov_iter instead of iovec.  Once we get to introducing
> those methods, it's just a matter of taking the rest of generic_file_aio_read()
> into a separate function and making that function an instance of new
> method...
> 
> * Unrelated to this patchset, but... may I politely inquire about the reasons
> why ->is_partially_uptodate() gets read_descriptor_t?  The whole point
> of read_descriptor_t (if it has any) is that its interpretation is up to
> whoever's doing the reading, _not_ what they are reading from.  So desc->arg
> is off-limits for any instance of ->is_partially_uptodate().  desc->written is
> obviously pointless for them; the need (or lack thereof) to do something
> to the page doesn't depend on how much have we already read from the
> file.  Moreover, reporting an error is rather dubious in such method;
> if there's something fishy with the page, we'll just return "no" and let
> ->readpage() complain.  Which leaves only desc->count, which, unsurprisingly,
> is the only thing looked at by (the only) instance of that method.  And
> "tell me if the part of this page that long starting from that offset is
> up to date" is much more natural that "is what this read_descriptor_t would
> have had us read from this offset in this page up to date?"...
> 
> * do we really need separate non-atomic variant of iov_iter_copy_from_user()?
> We have only two users right now (cifs and ceph) and both can use the
> fault-in / atomic copy loop without much pain...
> 
> * in addition to having too many methods, I'm not convinced that we want
> them to be methods.  Let's start with explicit checks in the primitives and
> see where it goes from there; if we start to grow too many variants,
> we can always introduce some methods, but then we'll be in better position
> to decide what is and what is not a good method...
> 
> * on the read side, I really don't believe that exposing atomic and
> non-atomic variants is a good idea.  Look at the existing users of
> __copy_to_user_inatomic(); leaving aside the i915_gem weirdness,
> all of them are used to implement the exact same thing: given a page,
> offset and length, feed its contents to iovec/buffer/whatnot.  Unlike
> the write side of things, there's nothing between prefaulting pages
> and actual attempts to copy.  So let's make _that_ an exposed primitive
> and let it deal with kmap/kmap_atomic/etc.  Variants that don't have
> to worry about blocking (vector of <page,offset,length>, etc.) would
> simply not bother with non-atomic kmap, etc.  Sure, it should take
> iov_iter as destination.  And deal with mapping the damn thing internally...
> 
> * ntfs_file_buffered_write() should switch to iov_iter as well.  It's
> open-coding a lot of iov_iter stuff.  It's not entirely trivial and
> I'd really like to hear from ntfs folks on that, though, and the current
> code looks deadlock-prone.  We prefault everything, then lock the pages
> to which we'll be copying, then attempt to do __copy_from_user_inatomic(),
> falling back to __copy_from_user() if that fails.  Fine, but what happens
> if the source of write() is mmaped from the same file and we lose CPU while
> prefaulting the last page, have memory pressure evict the first one, then
> have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages
> we'll be copying to and have __copy_from_user() try to copy *from* those
> same pages?  We are doing it while holding these pages locked, so pagefault
> will have really fun time with them...  Anton?
> 
> BTW, Linus, when did you comment on that patchset?  I've found an iteration
> of that patchset circa last October (v9, apparently the latest posted),
> but it looks like your comments had either got lost or had been on the
> earlier iteration of that thing...
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2014-02-03 16:24 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
2013-12-12 18:14 ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 1/5] splice: move balance_dirty_pages_ratelimited into pipe_to_file Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 2/5] splice: nest i_mutex outside pipe_lock Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 3/5] splice: use splice_from_pipe in generic_file_splice_write Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 4/5] xfs: fix splice_write locking Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2013-12-12 18:15 ` [PATCH 5/5] splice: stop exporting splice_from_pipe implementation details Christoph Hellwig
2013-12-12 18:15   ` Christoph Hellwig
2014-01-13 14:14 ` [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
2014-01-13 14:14   ` Christoph Hellwig
2014-01-13 23:56   ` Al Viro
2014-01-13 23:56     ` Al Viro
2014-01-14 13:22     ` Christoph Hellwig
2014-01-14 13:22       ` Christoph Hellwig
2014-01-14 17:20       ` Al Viro
2014-01-14 17:20         ` Al Viro
2014-01-15 18:10         ` Al Viro
2014-01-15 18:10           ` Al Viro
2014-01-18  6:40         ` Al Viro
2014-01-18  7:22           ` Linus Torvalds
2014-01-18  7:22             ` Linus Torvalds
2014-01-18  7:46             ` Al Viro
2014-01-18  7:56               ` Al Viro
2014-01-18  7:56                 ` Al Viro
2014-01-18  8:27               ` Al Viro
2014-01-18  8:44                 ` David Miller
2014-01-18  8:44                   ` David Miller
2014-02-07 17:10                   ` Al Viro
2014-02-07 17:10                     ` Al Viro
2014-01-18 19:59               ` Linus Torvalds
2014-01-18 20:10                 ` Al Viro
2014-01-18 20:27                   ` Al Viro
2014-01-18 20:27                     ` Al Viro
2014-01-18 20:30                     ` Al Viro
2014-01-18 20:30                       ` Al Viro
2014-01-19  5:13                   ` [RFC] unifying write variants for filesystems Al Viro
2014-01-19  5:13                     ` Al Viro
2014-01-20 13:55                     ` Christoph Hellwig
2014-01-20 13:55                       ` Christoph Hellwig
2014-01-20 20:32                       ` Linus Torvalds
2014-01-20 20:32                         ` Linus Torvalds
2014-02-01 22:43                         ` Al Viro
2014-02-01 22:43                           ` Al Viro
2014-02-02  0:13                           ` Linus Torvalds
2014-02-02  2:02                             ` Al Viro
2014-02-02  2:02                               ` Al Viro
2014-02-02 19:21                           ` Al Viro
2014-02-02 19:21                             ` Al Viro
2014-02-02 19:23                             ` Al Viro
2014-02-02 19:23                               ` Al Viro
2014-02-03 14:41                             ` Miklos Szeredi
2014-02-03 14:41                               ` Miklos Szeredi
2014-02-03 15:33                               ` Al Viro
2014-02-03 15:33                                 ` Al Viro
2014-02-02 23:16                           ` Anton Altaparmakov
2014-02-02 23:16                             ` Anton Altaparmakov
2014-02-03 15:12                           ` Christoph Hellwig
2014-02-03 16:24                             ` Al Viro
2014-02-03 16:50                             ` Dave Kleikamp
2014-02-03 16:23                           ` Dave Kleikamp [this message]
2014-02-04 12:44                             ` Al Viro
2014-02-04 12:44                               ` Al Viro
2014-02-04 12:52                               ` Kent Overstreet
2014-02-04 12:52                                 ` Kent Overstreet
2014-02-04 15:17                                 ` Al Viro
2014-02-04 15:17                                   ` Al Viro
2014-02-04 17:27                                   ` Zach Brown
2014-02-04 17:35                                     ` Kent Overstreet
2014-02-04 18:08                                       ` Al Viro
2014-02-04 18:08                                         ` Al Viro
2014-02-04 18:00                                     ` Al Viro
2014-02-04 18:00                                       ` Al Viro
2014-02-04 18:33                                       ` Zach Brown
2014-02-04 18:36                                         ` Al Viro
2014-02-04 18:36                                           ` Al Viro
2014-02-05 19:58                                           ` Al Viro
2014-02-05 20:42                                             ` Zach Brown
2014-02-06  9:08                                             ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52EFC271.3090205@oracle.com \
    --to=dave.kleikamp@oracle.com \
    --cc=anton@tuxera.com \
    --cc=axboe@kernel.dk \
    --cc=hch@infradead.org \
    --cc=jlbec@evilplan.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mfasheh@suse.com \
    --cc=sage@inktank.com \
    --cc=sfrench@samba.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@ZenIV.linux.org.uk \
    --cc=xfs@oss.sgi.com \
    --cc=zab@zabbo.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.