* [RFC] zero-copy extensions for rsockets
@ 2012-07-31 18:18 Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A8237346A6E8D5-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: Hefty, Sean @ 2012-07-31 18:18 UTC (permalink / raw)
To: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)
Cc: Christoph Lameter (christoph-zt5rKe7wo/JBDgjK7y7TUQ@public.gmane.org),
Greg KH (gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org)
Before implementing this, I'm looking for feedback. The following proposal defines user-space APIs to support zero-copy. The intent is that the use of these extensions is fully compatible with existing calls, allowing applications to make selective use of them. Although I'm specifically looking at these calls for rsockets, I tried to make these generic enough that they could apply to a wider variety of technologies.
- Sean
--
/* Define option/flag to indicate asynchronous operation */
#define O_ASYNC ... /* fcntl option */
#define MSG_ASYNC ... /* send/recv flag */
/*
* ioq - fd used to report asynchronous completions.
* sockets/fd's report asynchronous events through an associated ioq
* ioq is usable with standard calls - fcntl, select, poll, read, etc.
*/
int ioq_create(int flags);
int ioq_add(int ioq, int fd, int flags);
int ioq_del(int ioq, int fd);
/* Reading from an ioq returns this structure */
struct ioq_event {
int fd;
int operation; /* IOREAD, IOWRITE, etc. */
int result; /* e.g. bytes transferred */
int errno;
void *ptr; /* context, e.g. address */
};
/* Register memory for zero-copy. */
off_t iomap(int fd, void *addr, size_t len, int prot, int flags, off_t offset);
int iounmap(int fd, off_t offset, size_t len);
/*
* Zero-copy read and write calls.
* If fd is nonblocking, then the operation must as asynchronous.
*/
size_t get(int fd, void *buf, size_t count, off_t offset, int flags);
size_t put(int fd, const void *buf, size_t count, off_t offset, int flags);
/* Technology specific operation */
int submit(int fd,
int operation, /* IOREAD, IOWRITE, vendor defined, etc. */
void *context,
void *request, /* request structures varies by operation */
size_t len, /* size of request structure */
int flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread[parent not found: <1828884A29C6694DAF28B7E6B8A8237346A6E8D5-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: [RFC] zero-copy extensions for rsockets [not found] ` <1828884A29C6694DAF28B7E6B8A8237346A6E8D5-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2012-07-31 18:32 ` Jason Gunthorpe [not found] ` <20120731183243.GA4755-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Jason Gunthorpe @ 2012-07-31 18:32 UTC (permalink / raw) To: Hefty, Sean Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Christoph Lameter (christoph-zt5rKe7wo/JBDgjK7y7TUQ@public.gmane.org), Greg KH (gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org) On Tue, Jul 31, 2012 at 06:18:40PM +0000, Hefty, Sean wrote: > Before implementing this, I'm looking for feedback. The following > proposal defines user-space APIs to support zero-copy. The intent > is that the use of these extensions is fully compatible with > existing calls, allowing applications to make selective use of them. > Although I'm specifically looking at these calls for rsockets, I > tried to make these generic enough that they could apply to a wider > variety of technologies. This looks very similar to the libaio interface.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20120731183243.GA4755-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* RE: [RFC] zero-copy extensions for rsockets [not found] ` <20120731183243.GA4755-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2012-07-31 20:33 ` Hefty, Sean [not found] ` <1828884A29C6694DAF28B7E6B8A8237346A6E926-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Hefty, Sean @ 2012-07-31 20:33 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Christoph Lameter (christoph-zt5rKe7wo/JBDgjK7y7TUQ@public.gmane.org), Greg KH (gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org) > This looks very similar to the libaio interface.. I did look at aio. It may be possible to use aio context in place of ioq, and I'm open to that. I was actually modeling ioq more after epoll than aio. It just seemed simpler to treat an ioq as a standard fd. For the get/put calls, there's no requirement to use asynchronous or nonblocking I/O. When asynchronous operations are used, restricting each socket to a single, persistent ioq thingy simplifies the implementation by making the mapping between an ioq and HW CQs easier to manage. My concern is that supporting a more flexible API, like aoi, would effectively result in losing some desirable feature handling completions, such as kernel bypass or reducing interrupts. With aio, I'm unsure about the impact of supporting callback notifications and the selection of each aio context on a per request basis. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <1828884A29C6694DAF28B7E6B8A8237346A6E926-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: [RFC] zero-copy extensions for rsockets [not found] ` <1828884A29C6694DAF28B7E6B8A8237346A6E926-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2012-07-31 21:34 ` Jason Gunthorpe [not found] ` <20120731213450.GA5787-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Jason Gunthorpe @ 2012-07-31 21:34 UTC (permalink / raw) To: Hefty, Sean Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Christoph Lameter (christoph-zt5rKe7wo/JBDgjK7y7TUQ@public.gmane.org), Greg KH (gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org) On Tue, Jul 31, 2012 at 08:33:49PM +0000, Hefty, Sean wrote: > > This looks very similar to the libaio interface.. > > I did look at aio. It may be possible to use aio context in place > of ioq, and I'm open to that. I was actually modeling ioq more > after epoll than aio. It just seemed simpler to treat an ioq as a > standard fd. libaio is designed to be used along with an eventfd that provides the epoll like semantics you are talking about. Each time you call io_submit you can call io_set_eventfd() on the iocb and the aio engine will trigger that eventfd when the IO completes. poll or epoll on the eventfd fd. > My concern is that supporting a more flexible API, like aoi, would > effectively result in losing some desirable feature handling > completions, such as kernel bypass or reducing interrupts. With > aio, I'm unsure about the impact of supporting callback > notifications and the selection of each aio context on a per request > basis. I'm not sure what you are refering to here? Are you mixing up POSIX aio with libaio? They are totally different. libaio has no callback notification mechanism, just io_getevents. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20120731213450.GA5787-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* RE: [RFC] zero-copy extensions for rsockets [not found] ` <20120731213450.GA5787-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2012-07-31 22:46 ` Hefty, Sean [not found] ` <1828884A29C6694DAF28B7E6B8A8237346A6E9E6-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Hefty, Sean @ 2012-07-31 22:46 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Christoph Lameter (christoph-zt5rKe7wo/JBDgjK7y7TUQ@public.gmane.org), Greg KH (gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org) > libaio is designed to be used along with an eventfd that provides the > epoll like semantics you are talking about. Each time you call > io_submit you can call io_set_eventfd() on the iocb and the aio engine > will trigger that eventfd when the IO completes. poll or epoll on the > eventfd fd. A search for io_set_eventfd() turned up several references, several of which refer to it as "undocumented". IMO, having aio simply return an fd rather than an abstract data type, coupled with an undocumented function would have been a much simpler way of designing aio to work with epoll/select/poll. :P > > My concern is that supporting a more flexible API, like aoi, would > > effectively result in losing some desirable feature handling > > completions, such as kernel bypass or reducing interrupts. With > > aio, I'm unsure about the impact of supporting callback > > notifications and the selection of each aio context on a per request > > basis. > > I'm not sure what you are refering to here? Are you mixing up POSIX > aio with libaio? possibly - I find different information based on looking for 'io' vs 'aio', though the differences are usually minor. Here are the calls I'm looking at from the man pages: int io_setup(unsigned nr_events, aio_context_t *ctxp); vs int io_queue_init(int maxevents, io_context_t *ctx); int io_submit(aio_context_t ctx_id, long nrstruct iocb **" iocbpp ); or int io_submit(io_context_t ctx, long nr, struct iocb *iocbs[]); void io_set_callback(struct iocb *iocb, io_callback_t cb); etc. Maybe I'm confused about the intent of io_set_callback when comparing it to the POSIX aio documentation, but the documentation for io_set_callback isn't helping me here. In any case, the aio calls associate an fd with an [a]io_context on each read/write. Since RDMA devices associate each send or receive queue with exactly 1 CQ, this makes it difficult to map an [a]io_context to a set of CQs. The API that I think would work well for these type of devices is one where an aio_context/ioq thingy would easily map to one or a small set of CQs (say, one per device), with each socket/fd having a fixed association to an ioq for its lifetime. This is where I see a mismatch with aio. Separately from aio, do you see issues with iomap/iounmap/get/put? - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <1828884A29C6694DAF28B7E6B8A8237346A6E9E6-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: [RFC] zero-copy extensions for rsockets [not found] ` <1828884A29C6694DAF28B7E6B8A8237346A6E9E6-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2012-07-31 23:15 ` Jason Gunthorpe [not found] ` <20120731231557.GA6956-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Jason Gunthorpe @ 2012-07-31 23:15 UTC (permalink / raw) To: Hefty, Sean Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Christoph Lameter (christoph-zt5rKe7wo/JBDgjK7y7TUQ@public.gmane.org), Greg KH (gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org) On Tue, Jul 31, 2012 at 10:46:22PM +0000, Hefty, Sean wrote: > > libaio is designed to be used along with an eventfd that provides the > > epoll like semantics you are talking about. Each time you call > > io_submit you can call io_set_eventfd() on the iocb and the aio engine > > will trigger that eventfd when the IO completes. poll or epoll on the > > eventfd fd. > > A search for io_set_eventfd() turned up several references, several > of which refer to it as "undocumented". IMO, having aio simply > return an fd rather than an abstract data type, coupled with an > undocumented function would have been a much simpler way of > designing aio to work with epoll/select/poll. :P Well, this is how it ended up, eventfd was added to the interface after it was accepted into mainline. It is actually quite easy to use and does have the added flexability of mapping different completions to different 'CQs'.. > > I'm not sure what you are refering to here? Are you mixing up POSIX > > aio with libaio? > > possibly - I find different information based on looking for 'io' vs 'aio', though the differences are usually minor. > > Here are the calls I'm looking at from the man pages: > > int io_setup(unsigned nr_events, aio_context_t *ctxp); > vs > int io_queue_init(int maxevents, io_context_t *ctx); > > int io_submit(aio_context_t ctx_id, long nrstruct iocb **" iocbpp ); > or > int io_submit(io_context_t ctx, long nr, struct iocb *iocbs[]); > > void io_set_callback(struct iocb *iocb, io_callback_t cb); Right, that is the libaio interface. > Maybe I'm confused about the intent of io_set_callback when > comparing it to the POSIX aio documentation, but the documentation > for io_set_callback isn't helping me here. io_set_callback is only used in conjunction with io_queue_run, which itself is just a wrapper around io_getevents that calls the function pointer stored in the data member for each completion. io_set_callback/io_queue_run does not seem to me to be a very useful interface, I've never wanted to use it for sure. > The API that I think would work well for these type of devices is > one where an aio_context/ioq thingy would easily map to one or a > small set of CQs (say, one per device), with each socket/fd having a > fixed association to an ioq for its lifetime. This is where I see a > mismatch with aio. I'm not sure that is so great, one of the benefits of the aio interface is you have just one queue and one eventfd to manage, no matter how many fd's you are AIOing against. Completions can happen out of order. Requiring an app to juggle multiple ioq thingies split on some arbitrary axis (ie by HCA, in particular) is very ugly from a user perspective. Matching IB WCs to io_context_t/iocb shouldn't be too hard, just an encoding in the wr_id, and it similarly shouldn't be too difficult to keep track of which CQs to poll on an io_get_events. What I would see as much more difficult is how to match your streaming RDMA WRITE ring algorithm used for synchronous read/write with asynchronous read/write and direct placement. That seems pretty complicated. > Separately from aio, do you see issues with iomap/iounmap/get/put? I'm not sure what semantics you are going for here? Is get/put the same as a AIO read/write, or are they RDMA? How does it work if one side is using read/write and the other does get/put? Are there two things here? async read/write and the get/put RDMAish stuff? At a minimum I think you'd want to prefix these names with rsockets_, since they are very likely to collide with something else. But, is this valuable? If people are going to have to do lots of rework to support these calls would they just be better off using something like CCI? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20120731231557.GA6956-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* RE: [RFC] zero-copy extensions for rsockets [not found] ` <20120731231557.GA6956-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2012-08-01 0:15 ` Hefty, Sean 0 siblings, 0 replies; 7+ messages in thread From: Hefty, Sean @ 2012-08-01 0:15 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Christoph Lameter (christoph-zt5rKe7wo/JBDgjK7y7TUQ@public.gmane.org), Greg KH (gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org) > I'm not sure that is so great, one of the benefits of the aio > interface is you have just one queue and one eventfd to manage, no > matter how many fd's you are AIOing against. Completions can happen > out of order. Requiring an app to juggle multiple ioq thingies split > on some arbitrary axis (ie by HCA, in particular) is very ugly from a > user perspective. I'm only referring to the interface. aio allows a user to create any number of aio context's, with the ability to direct every read/write to a different context. Sure a user can use just a single queue and eventfd, but that's not required. I was suggesting a more restrictive interface. One where a socket is bound to exactly one aio context, or at most two with sends and receives defined separately. So, it's known which CQ to poll. > What I would see as much more difficult is how to match your streaming > RDMA WRITE ring algorithm used for synchronous read/write with > asynchronous read/write and direct placement. That seems pretty > complicated. I would expect sent data to appear in the stream in the same order that the calls are made. Likewise, reads would complete in order. > I'm not sure what semantics you are going for here? Is get/put the > same as a AIO read/write, or are they RDMA? How does it work if one > side is using read/write and the other does get/put? Are there two > things here? async read/write and the get/put RDMAish stuff? Mapping to RDMA: iomap - register memory and publish address/key to remote side iounmap - unregister memory get - RDMA read put - RDMA write There are different things here. The primary goal is to add usable zero-copy support to rsockets. iomap/put/get are intended to address that on the receive side. (In the case of get, the initiator is also the receiver.) Asynchronous completions are intended to address this on the send side. A call like put can behave similar to write wrt a blocking or nonblocking socket. However, get doesn't make any sense as a nonblocking call without asynchronous completions. If asynchronous support is added for get/put, then it makes sense to extend that functionality to any data transfer call. But there's no requirement on the application to use it for other calls. It's probably works out better if they don't. > At a minimum I think you'd want to prefix these names with rsockets_, > since they are very likely to collide with something else. Yes - there would be a prefix. > But, is this valuable? If people are going to have to do lots of > rework to support these calls would they just be better off using > something like CCI? Personally, I've heard a lot of different developers ask specifically for simple socket extensions to support RDMA. The entire goal of rsockets is to minimize the changes needed for an application to use RDMA devices. So, I agree, if the solution requires a large amount of rework, it's not worth it. But if we can provide a small number of calls that a user can *selectively* use throughout their application that do avoid memory copies, then I believe there's significant value in doing so. The application does not need to change how they setup connections. Most of their communication can remain as-is. E.g. using read/write for small messages. But they now have the ability to integrate zero-copy calls into their app by one side calling iomap and the other side put. Such an option sounds substantially better than having to write to an entirely new API, such as verbs. Plus it enables an iterative approach to migrating to zero-copy calls. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-08-01 0:15 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-31 18:18 [RFC] zero-copy extensions for rsockets Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A8237346A6E8D5-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2012-07-31 18:32 ` Jason Gunthorpe
[not found] ` <20120731183243.GA4755-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2012-07-31 20:33 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A8237346A6E926-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2012-07-31 21:34 ` Jason Gunthorpe
[not found] ` <20120731213450.GA5787-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2012-07-31 22:46 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A8237346A6E9E6-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2012-07-31 23:15 ` Jason Gunthorpe
[not found] ` <20120731231557.GA6956-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2012-08-01 0:15 ` Hefty, Sean
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.