public inbox for linux-cifs@vger.kernel.org
 help / color / mirror / Atom feed
From: Ira Weiny <ira.weiny@intel.com>
To: David Howells <dhowells@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Al Viro <viro@zeniv.linux.org.uk>, <willy@infradead.org>,
	<dchinner@redhat.com>, Steve French <smfrench@gmail.com>,
	Shyam Prasad N <nspmangalore@gmail.com>,
	"Rohith Surabattula" <rohiths.msft@gmail.com>,
	Jeff Layton <jlayton@kernel.org>, <torvalds@linux-foundation.org>,
	<linux-cifs@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: How to convert I/O iterators to iterators, sglists and RDMA lists
Date: Thu, 20 Oct 2022 20:30:34 -0700	[thread overview]
Message-ID: <Y1ISWla50g5gHax6@iweiny-desk3> (raw)
In-Reply-To: <1415915.1666274636@warthog.procyon.org.uk>

On Thu, Oct 20, 2022 at 03:03:56PM +0100, David Howells wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > >  (1) Async direct I/O.
> > > 
> > >      In the async case direct I/O, we cannot hold on to the iterator when we
> > >      return, even if the operation is still in progress (ie. we return
> > >      EIOCBQUEUED), as it is likely to be on the caller's stack.
> > > 
> > >      Also, simply copying the iterator isn't sufficient as virtual userspace
> > >      addresses cannot be trusted and we may have to pin the pages that
> > >      comprise the buffer.
> > 
> > This is very related to the discussion we are having related to pinning
> > for O_DIRECT with Ira and Al.
> 
> Do you have a link to that discussion?  I don't see anything obvious on
> fsdevel including Ira.

I think Christoph meant to say John Hubbard.

> 
> I do see a discussion involving iov_iter_pin_pages, but I don't see Ira
> involved in that.

This one?

https://lore.kernel.org/all/20220831041843.973026-5-jhubbard@nvidia.com/

I've been casually reading it but not directly involved.

Ira

> 
> > What block file systems do is to take the pages from the iter and some flags
> > on what is pinned.  We can generalize this to store all extra state in a
> > flags word, or byte the bullet and allow cloning of the iter in one form or
> > another.
> 
> Yeah, I know.  A list of pages is not an ideal solution.  It can only handle
> contiguous runs of pages, possibly with a partial page at either end.  A bvec
> iterator would be of more use as it can handle a series of partial pages.
> 
> Note also that I would need to turn the pages *back* into an iterator in order
> to commune with sendmsg() in the nether reaches of some network filesystems.
> 
> > >  (2) Crypto.
> > > 
> > >      The crypto interface takes scatterlists, not iterators, so we need to
> > >      be able to convert an iterator into a scatterlist in order to do
> > >      content encryption within netfslib.  Doing this in netfslib makes it
> > >      easier to store content-encrypted files encrypted in fscache.
> > 
> > Note that the scatterlist is generally a pretty bad interface.  We've
> > been talking for a while to have an interface that takes a page array
> > as an input and return an array of { dma_addr, len } tuples.  Thinking
> > about it taking in an iter might actually be an even better idea.
> 
> It would be nice to be able to pass an iterator to the crypto layer.  I'm not
> sure what the crypto people think of that.
> 
> > >  (3) RDMA.
> > > 
> > >      To perform RDMA, a buffer list needs to be presented as a QPE array.
> > >      Currently, cifs converts the iterator it is given to lists of pages,
> > >      then each list to a scatterlist and thence to a QPE array.  I have
> > >      code to pass the iterator down to the bottom, using an intermediate
> > >      BVEC iterator instead of a page list if I can't pass down the
> > >      original directly (eg. an XARRAY iterator on the pagecache), but I
> > >      still end up converting it to a scatterlist, which is then converted
> > >      to a QPE.  I'm trying to go directly from an iterator to a QPE array,
> > >      thus avoiding the need to allocate an sglist.
> > 
> > I'm not sure what you mean with QPE.  The fundamental low-level
> > interface in RDMA is the ib_sge.
> 
> Sorry, yes. ib_sge array.  I think it appears as QPs on the wire.
> 
> > If you feed it to RDMA READ/WRITE requests the interface for that is the
> > RDMA R/W API in drivers/infiniband/core/rw.c, which currently takes a
> > scatterlist but to which all of the above remarks on DMA interface apply.
> > For RDMA SEND that ULP has to do a dma_map_single/page to fill it, which is
> > a quite horrible layering violation and should move into the driver, but
> > that is going to a massive change to the whole RDMA subsystem, so unlikely
> > to happen anytime soon.
> 
> In cifs, as it is upstream, in RDMA transmission, the iterator is converted
> into a clutch of pages in the top, which is converted back into iterators
> (smbd_send()) and those into scatterlists (smbd_post_send_data()), thence into
> sge lists (see smbd_post_send_sgl()).
> 
> I have patches that pass an iterator (which it decants to a bvec if async) all
> the way down to the bottom layer.  Snippets are then converted to scatterlists
> and those to sge lists.  I would like to skip the scatterlist intermediate and
> convert directly to sge lists.
> 
> On the other hand, if you think the RDMA API should be taking scatterlists
> rather than sge lists, that would be fine.  Even better if I can just pass an
> iterator in directly - though neither scatterlist nor iterator has a place to
> put the RDMA local_dma_key - though I wonder if that's actually necessary for
> each sge element, or whether it could be handed through as part of the request
> as a hole.
> 
> > Neither case has anything to do with what should be in common iov_iter
> > code, all this needs to live in the RDMA subsystem as a consumer.
> 
> That's fine in principle.  However, I have some extraction code that can
> convert an iterator to another iterator, an sglist or an rdma sge list, using
> a common core of code to do all three.
> 
> I can split it up if that is preferable.
> 
> Do you have code that's ready to be used?  I can make immediate use of it.
> 
> David
> 

  reply	other threads:[~2022-10-21  3:30 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-14 15:26 How to convert I/O iterators to iterators, sglists and RDMA lists David Howells
2022-10-17 13:15 ` Christoph Hellwig
2022-10-20 14:03   ` David Howells
2022-10-21  3:30     ` Ira Weiny [this message]
2022-10-24 14:51       ` Christoph Hellwig
2022-10-24 14:57     ` Christoph Hellwig
2022-10-24 19:53       ` Al Viro
2022-10-28  2:33         ` [PATCH v2 01/12] get rid of unlikely() on page_copy_sane() calls Al Viro
2022-10-28  2:33           ` [PATCH v2 02/12] csum_and_copy_to_iter(): handle ITER_DISCARD Al Viro
2022-10-28  2:33           ` [PATCH v2 03/12] [s390] copy_oldmem_kernel() - WRITE is "data source", not destination Al Viro
2022-10-28  2:33           ` [PATCH v2 04/12] [fsi] " Al Viro
2022-10-28  2:33           ` [PATCH v2 05/12] [infiniband] READ is "data destination", not source Al Viro
2022-10-28  2:33           ` [PATCH v2 06/12] [s390] zcore: WRITE is "data source", not destination Al Viro
2022-10-28  2:33           ` [PATCH v2 07/12] [s390] memcpy_real(): " Al Viro
2022-10-28  2:33           ` [PATCH v2 08/12] [target] fix iov_iter_bvec() "direction" argument Al Viro
2022-10-28  2:33           ` [PATCH v2 09/12] [vhost] fix 'direction' argument of iov_iter_{init,bvec}() Al Viro
2022-10-28  2:33           ` [PATCH v2 10/12] [xen] fix "direction" argument of iov_iter_kvec() Al Viro
2022-10-28 12:48             ` John Stoffel
2022-10-28 12:49               ` John Stoffel
2022-10-28  2:33           ` [PATCH v2 11/12] iov_iter: saner checks for attempt to copy to/from iterator Al Viro
2022-10-28  2:33           ` [PATCH v2 12/12] use less confusing names for iov_iter direction initializers Al Viro
2022-10-28 16:41             ` Linus Torvalds
2022-10-28 17:02               ` David Howells
2022-10-28 17:09                 ` Linus Torvalds
2022-10-28 17:15               ` Al Viro
2022-10-28 18:35                 ` Linus Torvalds
2022-10-28 19:30                   ` Al Viro
2022-10-28 20:34                     ` Linus Torvalds
2022-10-30  5:01                       ` Al Viro
2022-10-30  8:12           ` [PATCH v2 01/12] get rid of unlikely() on page_copy_sane() calls Christoph Hellwig
2022-10-28 17:31         ` How to convert I/O iterators to iterators, sglists and RDMA lists David Howells
2022-11-04 18:47           ` David Howells
2022-11-01 13:51         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1ISWla50g5gHax6@iweiny-desk3 \
    --to=ira.weiny@intel.com \
    --cc=dchinner@redhat.com \
    --cc=dhowells@redhat.com \
    --cc=hch@infradead.org \
    --cc=jlayton@kernel.org \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nspmangalore@gmail.com \
    --cc=rohiths.msft@gmail.com \
    --cc=smfrench@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox