Re: [PATCH v1 00/16] NFS/RDMA patches proposed for 4.1

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
To: Tom Talpey <tom@talpey.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Chuck Lever <chuck.lever@oracle.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	linux-rdma@vger.kernel.org, Steve French <smfrench@gmail.com>
Subject: Re: [PATCH v1 00/16] NFS/RDMA patches proposed for 4.1
Date: Wed, 6 May 2015 10:20:05 -0600	[thread overview]
Message-ID: <20150506162005.GA11331@obsidianresearch.com> (raw)
In-Reply-To: <55495D41.5090502@talpey.com>

On Tue, May 05, 2015 at 08:16:01PM -0400, Tom Talpey wrote:

> >The specific use-case of a RDMA to/from a logical linear region broken
> >up into HW pages is incredibly kernel specific, and very friendly to
> >hardware support.
> >
> >Heck, on modern systems 100% of these requirements can be solved just
> >by using the IOMMU. No need for the HCA at all. (HCA may be more
> >performant, of course)
> 
> I don't agree on "100%", because IOMMUs don't have the same protection
> attributes as RDMA adapters (local R, local W, remote R, remote W).

No, you do get protection - the IOMMU isn't the only resource, it would
still have to be combined with several pre-setup MR's that have the
proper protection attributes. You'd map the page list into the address
space that is covered by a MR that has the protection attributes
needed.

> Also they don't support handles for page lists quite like
> STags/RMRs, so they require additional (R)DMA scatter/gather. But, I
> agree with your point that they translate addresses just great.

??? the entire point of using the IOMMU in a context like this is to
linearize the page list into DMA'able address. How could you ever need
to scatter/gather when your memory is linear?

> >'post outbound rdma send/write of page region'
> 
> A bunch of writes followed by a send is a common sequence, but not
> very complex (I think).

So, I wasn't clear, I mean a general API that can post a SEND or RDMA
WRITE using a logically linear page list as the data source.

So this results in one of:
 1) A SEND with a gather list
 2) A SEND with a temporary linearized MR
 3) A series of RDMA WRITE with gather lists
 4) A RDMA WRITE with a temporary linearized MR

Picking one depends on the performance of the HCA and the various
features it supports. Even just the really simple options of #1 and #3
become a bit more complex when you want to take advantage of
transparent huge pages to reduce gather list length.

For instance, deciding when to trade off 3 vs 4 is going to be very
driver specific..

> >'prepare inbound rdma write of page region'
> 
> This is memory registration, with remote writability. That's what
> the rpcrdma_register_external() API in xprtrdma/verbs.c does. It
> takes a private rpcrdma structure, but it supports multiple memreg
> strategies and pretty much does what you expect. I'm sure someone
> could abstract it upward.

Right, most likely an implementation would just pull the NFS code into
the core, I think it is the broadest version we have?

> >'complete X'
> 
> This is trickier - invalidation has many interesting error cases.
> But, on a sunny day with the breeze at our backs, sure.

I don't mean send+invalidate, this is the 'free' for the 'alloc' the
above APIs might need (ie the temporary MR). You can't fail to free
the MR - that would be an insane API :)

> If Linux upper layers considered adopting a similar approach by
> carefully inserting RDMA operations conditionally, it can make
> the lower layer's job much more efficient. And, efficiency is speed.
> And in the end, the API throughout the stack will be simpler.

No idea for Linux. It seems to me most of the use cases we are talking
about here not actually assuming a socket, NFS-RDMA, SRP, iSER, Lustre
are all explicitly driving verbs and explicity working with pages
lists for their high speed side.

Does that mean we are already doing what you are talking about?

Jason

WARNING: multiple messages have this Message-ID (diff)

From: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
To: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
	Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Linux NFS Mailing List
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Steve French <smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: [PATCH v1 00/16] NFS/RDMA patches proposed for 4.1
Date: Wed, 6 May 2015 10:20:05 -0600	[thread overview]
Message-ID: <20150506162005.GA11331@obsidianresearch.com> (raw)
In-Reply-To: <55495D41.5090502-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>

On Tue, May 05, 2015 at 08:16:01PM -0400, Tom Talpey wrote:

> >The specific use-case of a RDMA to/from a logical linear region broken
> >up into HW pages is incredibly kernel specific, and very friendly to
> >hardware support.
> >
> >Heck, on modern systems 100% of these requirements can be solved just
> >by using the IOMMU. No need for the HCA at all. (HCA may be more
> >performant, of course)
> 
> I don't agree on "100%", because IOMMUs don't have the same protection
> attributes as RDMA adapters (local R, local W, remote R, remote W).

No, you do get protection - the IOMMU isn't the only resource, it would
still have to be combined with several pre-setup MR's that have the
proper protection attributes. You'd map the page list into the address
space that is covered by a MR that has the protection attributes
needed.

> Also they don't support handles for page lists quite like
> STags/RMRs, so they require additional (R)DMA scatter/gather. But, I
> agree with your point that they translate addresses just great.

??? the entire point of using the IOMMU in a context like this is to
linearize the page list into DMA'able address. How could you ever need
to scatter/gather when your memory is linear?

> >'post outbound rdma send/write of page region'
> 
> A bunch of writes followed by a send is a common sequence, but not
> very complex (I think).

So, I wasn't clear, I mean a general API that can post a SEND or RDMA
WRITE using a logically linear page list as the data source.

So this results in one of:
 1) A SEND with a gather list
 2) A SEND with a temporary linearized MR
 3) A series of RDMA WRITE with gather lists
 4) A RDMA WRITE with a temporary linearized MR

Picking one depends on the performance of the HCA and the various
features it supports. Even just the really simple options of #1 and #3
become a bit more complex when you want to take advantage of
transparent huge pages to reduce gather list length.

For instance, deciding when to trade off 3 vs 4 is going to be very
driver specific..

> >'prepare inbound rdma write of page region'
> 
> This is memory registration, with remote writability. That's what
> the rpcrdma_register_external() API in xprtrdma/verbs.c does. It
> takes a private rpcrdma structure, but it supports multiple memreg
> strategies and pretty much does what you expect. I'm sure someone
> could abstract it upward.

Right, most likely an implementation would just pull the NFS code into
the core, I think it is the broadest version we have?

> >'complete X'
> 
> This is trickier - invalidation has many interesting error cases.
> But, on a sunny day with the breeze at our backs, sure.

I don't mean send+invalidate, this is the 'free' for the 'alloc' the
above APIs might need (ie the temporary MR). You can't fail to free
the MR - that would be an insane API :)

> If Linux upper layers considered adopting a similar approach by
> carefully inserting RDMA operations conditionally, it can make
> the lower layer's job much more efficient. And, efficiency is speed.
> And in the end, the API throughout the stack will be simpler.

No idea for Linux. It seems to me most of the use cases we are talking
about here not actually assuming a socket, NFS-RDMA, SRP, iSER, Lustre
are all explicitly driving verbs and explicity working with pages
lists for their high speed side.

Does that mean we are already doing what you are talking about?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2015-05-06 16:20 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-13 21:21 [PATCH v1 00/16] NFS/RDMA patches proposed for 4.1 Chuck Lever
2015-03-13 21:21 ` [PATCH v1 01/16] xprtrdma: Display IPv6 addresses and port numbers correctly Chuck Lever
2015-03-13 21:21 ` [PATCH v1 02/16] xprtrdma: Perform a full marshal on retransmit Chuck Lever
2015-03-13 21:21 ` [PATCH v1 03/16] xprtrdma: Add vector of ops for each memory registration strategy Chuck Lever
2015-03-13 21:21 ` [PATCH v1 04/16] xprtrdma: Add a "max_payload" op for each memreg mode Chuck Lever
2015-03-13 21:22 ` [PATCH v1 05/16] xprtrdma: Add a "register_external" " Chuck Lever
2015-03-13 21:22 ` [PATCH v1 06/16] xprtrdma: Add a "deregister_external" " Chuck Lever
2015-03-17 14:37   ` Anna Schumaker
2015-03-17 15:04     ` Chuck Lever
2015-03-13 21:22 ` [PATCH v1 07/16] xprtrdma: Add "init MRs" memreg op Chuck Lever
2015-03-13 21:22 ` [PATCH v1 08/16] xprtrdma: Add "reset " Chuck Lever
2015-03-13 21:22 ` [PATCH v1 09/16] xprtrdma: Add "destroy " Chuck Lever
2015-03-13 21:22 ` [PATCH v1 10/16] xprtrdma: Add "open" " Chuck Lever
2015-03-17 15:16   ` Anna Schumaker
2015-03-17 15:19     ` Chuck Lever
2015-03-13 21:23 ` [PATCH v1 11/16] xprtrdma: Handle non-SEND completions via a callout Chuck Lever
2015-03-13 21:23 ` [PATCH v1 12/16] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external() Chuck Lever
2015-03-13 21:23 ` [PATCH v1 13/16] xprtrdma: Acquire MRs in rpcrdma_register_external() Chuck Lever
2015-03-13 21:23 ` [PATCH v1 14/16] xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy Chuck Lever
2015-03-13 21:23 ` [PATCH v1 15/16] xprtrdma: Make rpcrdma_{un}map_one() into inline functions Chuck Lever
2015-03-13 21:23 ` [PATCH v1 16/16] xprtrdma: Split rb_lock Chuck Lever
2015-05-05 15:44 ` [PATCH v1 00/16] NFS/RDMA patches proposed for 4.1 Christoph Hellwig
2015-05-05 15:44   ` Christoph Hellwig
2015-05-05 16:04   ` Chuck Lever
2015-05-05 16:04     ` Chuck Lever
2015-05-05 17:25     ` Christoph Hellwig
2015-05-05 17:25       ` Christoph Hellwig
2015-05-05 18:14       ` Tom Talpey
2015-05-05 18:14         ` Tom Talpey
2015-05-05 19:10         ` Christoph Hellwig
2015-05-05 19:10           ` Christoph Hellwig
2015-05-05 20:57           ` Tom Talpey
2015-05-05 20:57             ` Tom Talpey
2015-05-05 21:06             ` Christoph Hellwig
2015-05-05 21:06               ` Christoph Hellwig
2015-05-05 21:32               ` Tom Talpey
2015-05-05 21:32                 ` Tom Talpey
2015-05-05 22:38                 ` Jason Gunthorpe
2015-05-05 22:38                   ` Jason Gunthorpe
2015-05-06  0:16                   ` Tom Talpey
2015-05-06  0:16                     ` Tom Talpey
2015-05-06 16:20                     ` Jason Gunthorpe [this message]
2015-05-06 16:20                       ` Jason Gunthorpe
2015-05-06  7:01                   ` Bart Van Assche
2015-05-06  7:01                     ` Bart Van Assche
2015-05-06 16:38                     ` Jason Gunthorpe
2015-05-06 16:38                       ` Jason Gunthorpe
2015-05-06  7:33                 ` Christoph Hellwig
2015-05-06  7:33                   ` Christoph Hellwig
2015-05-06  7:09               ` Bart Van Assche
2015-05-06  7:09                 ` Bart Van Assche
2015-05-06  7:29                 ` Christoph Hellwig
2015-05-06  7:29                   ` Christoph Hellwig
2015-05-06 12:15       ` Sagi Grimberg
2015-05-06 12:15         ` Sagi Grimberg
  -- strict thread matches above, loose matches on Subject: below --
2015-03-13 21:26 Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150506162005.GA11331@obsidianresearch.com \
    --to=jgunthorpe@obsidianresearch.com \
    --cc=chuck.lever@oracle.com \
    --cc=hch@infradead.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=smfrench@gmail.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.