From: Dave Kleikamp <dave.kleikamp@oracle.com>
To: Al Viro <viro@ZenIV.linux.org.uk>,
Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>, Steve French <sfrench@samba.org>,
Sage Weil <sage@inktank.com>, Mark Fasheh <mfasheh@suse.com>,
xfs@oss.sgi.com, Christoph Hellwig <hch@infradead.org>,
Joel Becker <jlbec@evilplan.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Zach Brown <zab@zabbo.net>, Anton Altaparmakov <anton@tuxera.com>
Subject: Re: [RFC] unifying write variants for filesystems
Date: Mon, 03 Feb 2014 10:23:13 -0600 [thread overview]
Message-ID: <52EFC271.3090205@oracle.com> (raw)
In-Reply-To: <20140201224301.GS10323@ZenIV.linux.org.uk>
On 02/01/2014 04:43 PM, Al Viro wrote:
> On Mon, Jan 20, 2014 at 12:32:01PM -0800, Linus Torvalds wrote:
>> On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>> On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote:
>>>> Folks, what do you think about the following:
>>>
>>> That's very much what Shaggy did in the aio-direct tree, except that
>>> it kept using a single set of methods. Linus really didn't like it
>>> unfortunately.
>>
>> Umm. That wasn't what I objected to.
>>
>> I objected to the incredibly ugly implementation, the crazy new flags
>> arguments for core helper functions, ugly naming etc etc. I even
>> outlined what might fix it.
>>
>> In other words, I thought the code was shit and ugly. Not that
>> iterators per se would be wrong. Just doing them badly is wrong.
>
> Gyahhh... OK, I should've known better than go looking into that thing
> after such warning. Some relatively printable notes (i.e. about 10%
> of the comments I really had about that) follow:
Thanks for the feedback. I'd been asking for feedback on this patchset
for some time now, and have not received very much.
This is all based on some years-old work by Zach Brown that he probably
wishes would have disappeared by now. I pretty much left what I could
alone since 1) it was working, and 2) I didn't hear any objections
(until now).
It's clear now that the patchset isn't close to mergable, so treat it
like a proof-of-concept and we can come up with a better container and
read/write interface. I won't respond individually to your comments, but
will take them all into consideration going forward.
Thanks,
Shaggy
> * WTF bother passing 'pos' separately? It's the same mistake that was
> made with ->aio_read/->aio_write and just as with those, *all* callers
> provably have pos == iocb->ki_pos.
>
> * I'm not sure that iov_iter is a good fit here. OTOH, it probably could
> be reused (it has damn few users right now and they are on the codepaths
> involved into that thing).
>
> * We *definitely* want a variant structure with tag - unsigned long thing
> was just plain insane. I see at least two variants - array of iovecs
> and array of (at least) triples <page, offset, length>. Quite possibly -
> quadruples, with "here's how to try to steal this page" thrown in, if
> we want that as replacement for ->splice_write() as well (it looks like
> the few instances that do steal on pipe-to-file splices could be dealt
> with the same way as the dumb ones, provided that ->write_iter or whatever
> we end up calling it is allowed to try and steal pages). Possibly more
> variants on the read side of things... FWIW, I'm not sure that bio_vec
> makes a lot of sense here.
>
> * direction (i.e. source vs. destination) also should be a part of that
> tag. Which, BTW, turns ->direct_IO() into iocb x iov_iter -> int;
> the situation with pos is identical to aio_read/aio_write. While we
> are at it, KERNEL_WRITE thing (used only for __swap_writepage()) looks
> very much like a special case of "array of <page,offset,size>" we want
> for splice_write...
>
> * having looked through the ->read and ->write instances in drivers,
> I'd say that surprisingly few helpers would go a _long_ way towards
> converting those guys to the same methods. And we need such helpers
> anyway - there's a whole lot of (badly) open-coded "copy the whole
> user buffer into kmalloc'ed array and NUL-terminate the sucker" logics
> in ->write() instances, for example. Even more "copy up to that much
> into given array and NUL-terminate it", with rather amusing bugs in
> there - e.g. NUL going into the end of array, regardless of the actual
> amount of data to copy; junk is left in the middle, complete with
> printk of the entire thing if it doesn't make sense. IOW, spewing
> random piece of kernel stack into dmesg. Off-by-ones galore, too...
>
> BTW, speaking of bogosities - grep for generic_write_checks(). Exactly
> one caller (in __generic_file_aio_write()) has any business looking
> at S_ISBLK(inode->i_mode) - it can be called by blkdev_aio_write().
> All other callers have copied that, even though it makes absolutely
> no sense for them...
>
> * file_readable/file_writable looks really wrong; if nothing else, I would
> rather check that after open() and set a couple of FMODE bits, then check
> those. Possibly even "knock FMODE_READ/FMODE_WRITE out if there's no
> method"...
>
> * pipe_buffer_operations ->map()/->unmap() should die; let the caller do
> k{,un}map{,_atomic}(). All instances have the same method there and
> there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also
> go.
>
> * WTF do iov_iter_copy_from_user_atomic() callers keep doing pagefault_disable
> and pagefault_enable around it? The sucker starts with kmap_atomic() and
> ends with kunmap_atomic(); all instances of those guys have pagefaul
> disabling/enabling (and I suspect that it might make sense to lift it
> out of arch-specific variants - rename them to e.g. __kmap_atomic(),
> rip pagefault_disable() out of those and put kmap_atomic() into highmem.h
> outside of ifdef, with !HIGHMEM side of ifdef having #define __kmap_atomic
> page_address; move pagefault_enable() from __kunmap_atomic() to
> kunmap_atomic() while we are at it).
>
> Note that e.g. pipe.c relies on kmap_atomic() disabling pagefaults - we
> have __copy_from_user_inatomic() done there under kmap_atomic(), and we
> really don't want to block in such conditions.
>
> * pipe_iov_copy_from_user() ought to return how much has it managed to
> bring in, not 0 or -EFAULT as it does now. Then it won't need the
> non-atomic side, AFAICS. Moreover, we'll just be able to use
> iov_iter_copy_from_user_atomic() (which badly needs a more palatable
> name, BTW).
>
> * sure, we want to call do_generic_file_read() once, passing the entire
> iovec to file_read_actor(). But what the hell does that have to do with
> introduction of new methods? It's a change that makes sense on its
> own. Moreover, it's a damn good preparation to adding those - we
> get generic_file_aio_read() into "set iov_iter up, then do <this>",
> with <this> using iov_iter instead of iovec. Once we get to introducing
> those methods, it's just a matter of taking the rest of generic_file_aio_read()
> into a separate function and making that function an instance of new
> method...
>
> * Unrelated to this patchset, but... may I politely inquire about the reasons
> why ->is_partially_uptodate() gets read_descriptor_t? The whole point
> of read_descriptor_t (if it has any) is that its interpretation is up to
> whoever's doing the reading, _not_ what they are reading from. So desc->arg
> is off-limits for any instance of ->is_partially_uptodate(). desc->written is
> obviously pointless for them; the need (or lack thereof) to do something
> to the page doesn't depend on how much have we already read from the
> file. Moreover, reporting an error is rather dubious in such method;
> if there's something fishy with the page, we'll just return "no" and let
> ->readpage() complain. Which leaves only desc->count, which, unsurprisingly,
> is the only thing looked at by (the only) instance of that method. And
> "tell me if the part of this page that long starting from that offset is
> up to date" is much more natural that "is what this read_descriptor_t would
> have had us read from this offset in this page up to date?"...
>
> * do we really need separate non-atomic variant of iov_iter_copy_from_user()?
> We have only two users right now (cifs and ceph) and both can use the
> fault-in / atomic copy loop without much pain...
>
> * in addition to having too many methods, I'm not convinced that we want
> them to be methods. Let's start with explicit checks in the primitives and
> see where it goes from there; if we start to grow too many variants,
> we can always introduce some methods, but then we'll be in better position
> to decide what is and what is not a good method...
>
> * on the read side, I really don't believe that exposing atomic and
> non-atomic variants is a good idea. Look at the existing users of
> __copy_to_user_inatomic(); leaving aside the i915_gem weirdness,
> all of them are used to implement the exact same thing: given a page,
> offset and length, feed its contents to iovec/buffer/whatnot. Unlike
> the write side of things, there's nothing between prefaulting pages
> and actual attempts to copy. So let's make _that_ an exposed primitive
> and let it deal with kmap/kmap_atomic/etc. Variants that don't have
> to worry about blocking (vector of <page,offset,length>, etc.) would
> simply not bother with non-atomic kmap, etc. Sure, it should take
> iov_iter as destination. And deal with mapping the damn thing internally...
>
> * ntfs_file_buffered_write() should switch to iov_iter as well. It's
> open-coding a lot of iov_iter stuff. It's not entirely trivial and
> I'd really like to hear from ntfs folks on that, though, and the current
> code looks deadlock-prone. We prefault everything, then lock the pages
> to which we'll be copying, then attempt to do __copy_from_user_inatomic(),
> falling back to __copy_from_user() if that fails. Fine, but what happens
> if the source of write() is mmaped from the same file and we lose CPU while
> prefaulting the last page, have memory pressure evict the first one, then
> have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages
> we'll be copying to and have __copy_from_user() try to copy *from* those
> same pages? We are doing it while holding these pages locked, so pagefault
> will have really fun time with them... Anton?
>
> BTW, Linus, when did you comment on that patchset? I've found an iteration
> of that patchset circa last October (v9, apparently the latest posted),
> but it looks like your comments had either got lost or had been on the
> earlier iteration of that thing...
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-02-03 16:23 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-12 18:14 [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
2013-12-12 18:15 ` [PATCH 1/5] splice: move balance_dirty_pages_ratelimited into pipe_to_file Christoph Hellwig
2013-12-12 18:15 ` [PATCH 2/5] splice: nest i_mutex outside pipe_lock Christoph Hellwig
2013-12-12 18:15 ` [PATCH 3/5] splice: use splice_from_pipe in generic_file_splice_write Christoph Hellwig
2013-12-12 18:15 ` [PATCH 4/5] xfs: fix splice_write locking Christoph Hellwig
2013-12-12 18:15 ` [PATCH 5/5] splice: stop exporting splice_from_pipe implementation details Christoph Hellwig
2014-01-13 14:14 ` [PATCH 0/5] splice: locking changes and code refactoring Christoph Hellwig
2014-01-13 23:56 ` Al Viro
2014-01-14 13:22 ` Christoph Hellwig
2014-01-14 17:20 ` Al Viro
2014-01-15 18:10 ` Al Viro
2014-01-18 6:40 ` Al Viro
2014-01-18 7:22 ` Linus Torvalds
2014-01-18 7:46 ` Al Viro
2014-01-18 7:56 ` Al Viro
2014-01-18 8:27 ` Al Viro
2014-01-18 8:44 ` David Miller
2014-02-07 17:10 ` Al Viro
2014-01-18 19:59 ` Linus Torvalds
2014-01-18 20:10 ` Al Viro
2014-01-18 20:27 ` Al Viro
2014-01-18 20:30 ` Al Viro
2014-01-19 5:13 ` [RFC] unifying write variants for filesystems Al Viro
2014-01-20 13:55 ` Christoph Hellwig
2014-01-20 20:32 ` Linus Torvalds
2014-02-01 22:43 ` Al Viro
2014-02-02 0:13 ` Linus Torvalds
2014-02-02 2:02 ` Al Viro
2014-02-02 19:21 ` Al Viro
2014-02-02 19:23 ` Al Viro
2014-02-03 14:41 ` Miklos Szeredi
2014-02-03 15:33 ` Al Viro
2014-02-02 23:16 ` Anton Altaparmakov
2014-02-03 15:12 ` Christoph Hellwig
2014-02-03 16:24 ` Al Viro
2014-02-03 16:50 ` Dave Kleikamp
2014-02-03 16:23 ` Dave Kleikamp [this message]
2014-02-04 12:44 ` Al Viro
2014-02-04 12:52 ` Kent Overstreet
2014-02-04 15:17 ` Al Viro
2014-02-04 17:27 ` Zach Brown
2014-02-04 17:35 ` Kent Overstreet
2014-02-04 18:08 ` Al Viro
2014-02-04 18:00 ` Al Viro
2014-02-04 18:33 ` Zach Brown
2014-02-04 18:36 ` Al Viro
2014-02-05 19:58 ` Al Viro
2014-02-05 20:42 ` Zach Brown
2014-02-06 9:08 ` Kent Overstreet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52EFC271.3090205@oracle.com \
--to=dave.kleikamp@oracle.com \
--cc=anton@tuxera.com \
--cc=axboe@kernel.dk \
--cc=hch@infradead.org \
--cc=jlbec@evilplan.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mfasheh@suse.com \
--cc=sage@inktank.com \
--cc=sfrench@samba.org \
--cc=torvalds@linux-foundation.org \
--cc=viro@ZenIV.linux.org.uk \
--cc=xfs@oss.sgi.com \
--cc=zab@zabbo.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).