linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Mike Marshall <hubcap@omnibond.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: updated orangefs tree at kernel.org
Date: Thu, 8 Oct 2015 22:07:45 +0100	[thread overview]
Message-ID: <20151008210745.GW22011@ZenIV.linux.org.uk> (raw)
In-Reply-To: <CAOg9mSTOaEUR76NT-ZD6Msvq=aMfPPhMX6AhMc1SCxQCj_ncNA@mail.gmail.com>

	Something's fishy around pvfs2_inode_read().  Look:

* ->readpage (== read_one_page) is called in normal context; no set_fs()
in sight.

* read_one_page() does kmap() of the page it had been given.  Result is
a kernel pointer.  That pointer is (mis)annotated as userland one and
passed to pvfs2_inode_read().

* pvfs2_inode_read() shoves it into struct iovec and passes that to
wait_for_direct_io() (type == PVFS_IO_READ).  From there it gets
passed to postcopy_buffers(), where we do
                iov_iter_init(&iter, READ, vec, nr_segs, total_size);
                ret = pvfs_bufmap_copy_to_iovec(bufmap, 
                                                &iter,
                                                buffer_index);
which does copy_page_to_iter().

* after that iov_iter_init() we have iter.type equal to ITER_IOVEC | READ,
so copy_page_to_iter() uses __copy_to_user_inatomic() or falls back to
__copy_to_user().  Giving a kernel pointer to either is going to break on
any architecture where separate MMU contexts are used for kernel and userland.
On x86 we get away with that, but e.g. on sparc64 we won't.

More to the point, why do you bother with kmap() at all?  Just set a BVEC_ITER
over that struct page and let copy_page_to_iter() do the right thing when
it comes to it.  IOW, have iov_iter passed to pvfs2_inode_read() /
wait_for_direct_io() / postcopy_buffers() and be done with that - have
read_one_page() do
	struct iov_iter iter;
	struct bvec bv = {.bv_page = page, .bv_len = PAGE_SIZE, .bv_offset = 0};
	iov_iter_init_bvec(&iter, ITER_BVEC | READ, &bv, 1, PAGE_SIZE);

        max_block = ((inode->i_size / blocksize) + 1);

        if (page->index < max_block) {
                loff_t blockptr_offset = (((loff_t) page->index) << blockbits);

                bytes_read = pvfs2_inode_read(inode,
                                              &iter,
                                              blocksize,
                                              &blockptr_offset,
                                              inode->i_size);
	}
	/* will only zero remaining unread portions of the page data */
	iov_iter_zero(~0U, &iter);

and then proceed as you currently do, except that no kunmap() is needed
at all.

precopy_buffers() would also need to take iov_iter, and the other caller
of wait_for_direct_io() (i.e. do_readv_writev()) will need to pass it
an iov_iter as well.

Said that, do_readv_writev() also needs some taking care of.  Both callers
are giving it iter->iov (which, BTW, is only valid for IOV_ITER ones),
and then play very odd games with splitting iovecs if the total size exceeds
your pvfs_bufmap_size_query().  What for?

Sure, I understand not wanting to submit IO in chunks exceeding your limit.
But why do you need a separate iovec each time?  One thing that iov_iter
does is handling the copying to/from partially used iovecs.  copy_page_to_iter
and copy_page_from_iter are just fine with that.  IOW, why bother with
split_iovecs() at all?

wait_for_direct_io() uses 'vec' and 'nr_segs' only for passing them to
{pre,post}copy_buffers(), where they are used to only to initialize iov_iter.
And since copy_page_{to,from}_iter() advances the iterator, there's no
need to reinitialize the damn thing at all.  Just pass the amount to copy
as an explicit argument, rather than using iov_iter_count() in
pvfs_bufmap_copy_to_iovec().

Am I missing anything subtle in there?  Looks like do_readv_writev() would
get much simpler from that *and* ->read_iter()/->write_iter() will work
on arbitrary iov_iter after that...

  reply	other threads:[~2015-10-08 21:07 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-07 20:07 updated orangefs tree at kernel.org Mike Marshall
2015-10-08 21:07 ` Al Viro [this message]
2015-10-08 23:03   ` Al Viro
2015-10-09  3:41     ` Al Viro
2015-10-09  6:13       ` Al Viro
2015-10-09 19:22       ` Al Viro
2015-10-10  2:31     ` Al Viro
2015-10-10 20:23       ` Al Viro
2015-10-10 23:10       ` orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org) Al Viro
2015-10-12 16:20         ` Mike Marshall
2015-10-12 19:16           ` Al Viro
2015-10-13 17:46             ` Mike Marshall
2015-10-13 23:34               ` Al Viro
  -- strict thread matches above, loose matches on Subject: below --
2015-11-16 20:01 updated orangefs tree at kernel.org Mike Marshall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151008210745.GW22011@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=hubcap@omnibond.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).