public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>,
	David Howells <dhowells@redhat.com>,
	willy@infradead.org, dchinner@redhat.com,
	Steve French <smfrench@gmail.com>,
	Shyam Prasad N <nspmangalore@gmail.com>,
	Rohith Surabattula <rohiths.msft@gmail.com>,
	Jeff Layton <jlayton@kernel.org>, Ira Weiny <ira.weiny@intel.com>,
	torvalds@linux-foundation.org, linux-cifs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	jlayton@redhat.com
Subject: Re: How to convert I/O iterators to iterators, sglists and RDMA lists
Date: Tue, 1 Nov 2022 06:51:53 -0700	[thread overview]
Message-ID: <Y2EkeULBA3zsiarf@infradead.org> (raw)
In-Reply-To: <Y1btOP0tyPtcYajo@ZenIV>

On Mon, Oct 24, 2022 at 08:53:28PM +0100, Al Viro wrote:
> 1) iter-to-scatterlist use is much wider than RDMA.  Other places like that
> include e.g. vhost_scsi_map_to_sgl(), p9_get_mapped_pages(),
> rds_message_zcopy_from_user(), tls_setup_from_iter()...

RDS is RDMA.  vhost_scsi_map_to_sgl and p9_get_mapped_pages do some
odd virtio thing.  But point taken, it is spread further than it should
be at the moment.  It is however a rather bad data structure that really
should not spead much further.

> 2) there's a limit to how far we can propagate an arbitrary iov_iter -
> ITER_IOVEC/ITER_UBUF ones are absolutely tied to mm_struct of the
> originating process.  We can't use them for anything async - not
> without the horrors a-la use_mm().

But why would you pass them on?  It is much better to just convert
them to a bio_vec and pass that on.  We could still feed that to n
iter later, and in fact there are a bunch of good reasons to do so.
But in pretty much all those cases you really do not want to keep
the whole iov_iter state.

> 	We can do separate sendmsg() for kvec and bvec parts,
> but that doesn't come for free either.  *AND* bvec part is very
> likely not the original iterator we got those pages from.

sendmsg model seems to be very much built around that model with
MSG_MORE.  But even with a 'converter' how do you plan to build
such a mixed iter anyay?

> My problem with all that stuff is that we ought to sort out the
> lifetime and pin_user issues around the iov_iter.  What I really
> want to avoid is "no worries, we'd extracted stuff into ITER_BVEC, it's
> stable and can be passed around in arbitrary way" kind of primitive.
> Because *that* has no chance to work.

Yes.  I think the first thing we need in this whole area is to sort
the pinning out.  After that we can talk about all kinds of convenience
helpers.

> As far as I can see, we have the following constraints:
> 
> 	* page references put into ITER_BVEC (and ITER_XARRAY) must not
> go away while the iov_iter is being used.  That's on the creator of
> iov_iter.

*nod*

> 	* pages found in iterator might be used past the lifetime of
> iterator.  We need the underlying pages to survive until the last
> use.  "Grab a page reference" is *NOT* a solution in general case.
> 	* pages found in data-destination iterator may have their
> contents modified, both during the iterator lifetime and asynchronously.

This is where the trouble start.  If you want to be able to feed
kmalloced data into throgh ITER_KVEC (or ITER_BVEC for the matter),
you can't just grab any kind of hold to it.  The only way to do that
is by telling the caller you're done with it.  I.e. how aio/io_ring/etc
use ki_complete - the callee owns the data until it declares it is done
by calling ->ki_complete.  But no 'borrowing' of refeferences as the
only sane way to do that would be page refcounts, but those do not
work for everything.

> If it has a chance to be a user-mapped page, we must either
> 	a) have it locked by caller and have no modifications after
> it gets unlocked or
> 	b) have it pinned (sensu pin_user_pages()) by the caller and
> have no modifications until the unpin_user_page().

Yes.  And I think we need a good counter part to iov_iter_pin_pages
that undoes any required pinning, so that users of iov_iter_pin_pages
and iov_iter_unpin_pages can use these helpers without even thinking
about the rules.  That requires passing some amount of state to the
unpin side.  It could just be an unsigned long with flags probably,
or we keep the iov_iter alive and look at that.

> Another issue with iov_iter_get_pages...() is that compound page turns
> into a bunch of references to individual subpages; io-uring folks have
> noticed the problem, but their solution is... inelegant.  I wonder if
> we would be better off with a variant of the primitive that would give
> out compound pages; it would need different calling conventions,
> obviously (current ones assume that all pages except the first and
> the last one have PAGE_SIZE worth of data in them).

The new name for compound pages is folios, and yes the whole get/pin
user pages machinery needs to switch to that.

      parent reply	other threads:[~2022-11-01 13:52 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-14 15:26 How to convert I/O iterators to iterators, sglists and RDMA lists David Howells
2022-10-17 13:15 ` Christoph Hellwig
2022-10-20 14:03   ` David Howells
2022-10-21  3:30     ` Ira Weiny
2022-10-24 14:51       ` Christoph Hellwig
2022-10-24 14:57     ` Christoph Hellwig
2022-10-24 19:53       ` Al Viro
2022-10-28  2:33         ` [PATCH v2 01/12] get rid of unlikely() on page_copy_sane() calls Al Viro
2022-10-28  2:33           ` [PATCH v2 02/12] csum_and_copy_to_iter(): handle ITER_DISCARD Al Viro
2022-10-28  2:33           ` [PATCH v2 03/12] [s390] copy_oldmem_kernel() - WRITE is "data source", not destination Al Viro
2022-10-28  2:33           ` [PATCH v2 04/12] [fsi] " Al Viro
2022-10-28  2:33           ` [PATCH v2 05/12] [infiniband] READ is "data destination", not source Al Viro
2022-10-28  2:33           ` [PATCH v2 06/12] [s390] zcore: WRITE is "data source", not destination Al Viro
2022-10-28  2:33           ` [PATCH v2 07/12] [s390] memcpy_real(): " Al Viro
2022-10-28  2:33           ` [PATCH v2 08/12] [target] fix iov_iter_bvec() "direction" argument Al Viro
2022-10-28  2:33           ` [PATCH v2 09/12] [vhost] fix 'direction' argument of iov_iter_{init,bvec}() Al Viro
2022-10-28  2:33           ` [PATCH v2 10/12] [xen] fix "direction" argument of iov_iter_kvec() Al Viro
2022-10-28 12:48             ` John Stoffel
2022-10-28 12:49               ` John Stoffel
2022-10-28  2:33           ` [PATCH v2 11/12] iov_iter: saner checks for attempt to copy to/from iterator Al Viro
2022-10-28  2:33           ` [PATCH v2 12/12] use less confusing names for iov_iter direction initializers Al Viro
2022-10-28 16:41             ` Linus Torvalds
2022-10-28 17:02               ` David Howells
2022-10-28 17:09                 ` Linus Torvalds
2022-10-28 17:15               ` Al Viro
2022-10-28 18:35                 ` Linus Torvalds
2022-10-28 19:30                   ` Al Viro
2022-10-28 20:34                     ` Linus Torvalds
2022-10-30  5:01                       ` Al Viro
2022-10-30  8:12           ` [PATCH v2 01/12] get rid of unlikely() on page_copy_sane() calls Christoph Hellwig
2022-10-28 17:31         ` How to convert I/O iterators to iterators, sglists and RDMA lists David Howells
2022-11-04 18:47           ` David Howells
2022-11-01 13:51         ` Christoph Hellwig [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y2EkeULBA3zsiarf@infradead.org \
    --to=hch@infradead.org \
    --cc=dchinner@redhat.com \
    --cc=dhowells@redhat.com \
    --cc=ira.weiny@intel.com \
    --cc=jlayton@kernel.org \
    --cc=jlayton@redhat.com \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nspmangalore@gmail.com \
    --cc=rohiths.msft@gmail.com \
    --cc=smfrench@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox