* nfsd: managing pages under network I/O @ 2016-10-12 17:42 Chuck Lever 2016-10-13 6:35 ` Christoph Hellwig 2016-10-19 17:26 ` J. Bruce Fields 0 siblings, 2 replies; 9+ messages in thread From: Chuck Lever @ 2016-10-12 17:42 UTC (permalink / raw) To: Linux NFS Mailing List I'm studying the way that the ->recvfrom and ->sendto calls work for RPC-over-RDMA. The ->sendto path moves pages out of the svc_rqst before posting I/O (RDMA Write and Send). Once the work is posted, ->sendto returns, and looks like svc_rqst is released at that point. The subsequent completion of the Send then releases those moved pages. I'm wondering if the transport can be simplified: instead of moving pages around, ->sendto could just wait until the Write and Send activity is complete, then return. The upper layer then releases everything. Another option would be for ->sendto to return a value that means the transport will release the svc_rqst and pages. Or, the svc_rqst could be reference counted. Anyone have thoughts about this? -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: nfsd: managing pages under network I/O 2016-10-12 17:42 nfsd: managing pages under network I/O Chuck Lever @ 2016-10-13 6:35 ` Christoph Hellwig 2016-10-13 13:36 ` Chuck Lever 2016-10-19 17:26 ` J. Bruce Fields 1 sibling, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2016-10-13 6:35 UTC (permalink / raw) To: Chuck Lever; +Cc: Linux NFS Mailing List On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote: > I'm studying the way that the ->recvfrom and ->sendto calls work > for RPC-over-RDMA. > > The ->sendto path moves pages out of the svc_rqst before posting > I/O (RDMA Write and Send). Once the work is posted, ->sendto > returns, and looks like svc_rqst is released at that point. The > subsequent completion of the Send then releases those moved pages. > > I'm wondering if the transport can be simplified: instead of > moving pages around, ->sendto could just wait until the Write and > Send activity is complete, then return. The upper layer then > releases everything. I'd prefer not block for no reason at all. > Another option would be for ->sendto to return a value that means > the transport will release the svc_rqst and pages. Or just let the transport always release it. We only have two different implementations of the relevant ops anyway. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: nfsd: managing pages under network I/O 2016-10-13 6:35 ` Christoph Hellwig @ 2016-10-13 13:36 ` Chuck Lever 2016-10-13 14:32 ` Chuck Lever 0 siblings, 1 reply; 9+ messages in thread From: Chuck Lever @ 2016-10-13 13:36 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Linux NFS Mailing List > On Oct 13, 2016, at 2:35 AM, Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote: >> I'm studying the way that the ->recvfrom and ->sendto calls work >> for RPC-over-RDMA. >> >> The ->sendto path moves pages out of the svc_rqst before posting >> I/O (RDMA Write and Send). Once the work is posted, ->sendto >> returns, and looks like svc_rqst is released at that point. The >> subsequent completion of the Send then releases those moved pages. >> >> I'm wondering if the transport can be simplified: instead of >> moving pages around, ->sendto could just wait until the Write and >> Send activity is complete, then return. The upper layer then >> releases everything. > > I'd prefer not block for no reason at all. The reason to block is to wait for I/O to complete. Can you elaborate: for example, are you leery of an extra context switch? Note that in the Read/recvfrom case, RDMA Reads are posted, and then svc_rdma_recvfrom returns "deferred." When the Reads complete, the completion handler enqueues the work on the svc_xprt and svc_rdma_recvfrom is called again. Can someone explain why svc_rdma_recvfrom doesn't just wait for the RDMA Reads to complete? What is the purpose of pulling the pages out of the svc_rqst while waiting for the RDMA Reads? >> Another option would be for ->sendto to return a value that means >> the transport will release the svc_rqst and pages. > > Or just let the transport always release it. We only have two > different implementations of the relevant ops anyway. If each transport is responsible for releasing svc_rqst and pages, code is duplicated. Both transports need to do this releasing, so it should be done in common code (the svc RPC code). I'd prefer not to have to touch the TCP transport in order to improve the RDMA transport implementation. However, some clean-up and refactoring in this area may be unavoidable. -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: nfsd: managing pages under network I/O 2016-10-13 13:36 ` Chuck Lever @ 2016-10-13 14:32 ` Chuck Lever 0 siblings, 0 replies; 9+ messages in thread From: Chuck Lever @ 2016-10-13 14:32 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Linux NFS Mailing List > On Oct 13, 2016, at 9:36 AM, Chuck Lever <chuck.lever@oracle.com> wrote: > >> >> On Oct 13, 2016, at 2:35 AM, Christoph Hellwig <hch@infradead.org> wrote: >> >> On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote: >>> I'm studying the way that the ->recvfrom and ->sendto calls work >>> for RPC-over-RDMA. >>> >>> The ->sendto path moves pages out of the svc_rqst before posting >>> I/O (RDMA Write and Send). Once the work is posted, ->sendto >>> returns, and looks like svc_rqst is released at that point. The >>> subsequent completion of the Send then releases those moved pages. >>> >>> I'm wondering if the transport can be simplified: instead of >>> moving pages around, ->sendto could just wait until the Write and >>> Send activity is complete, then return. The upper layer then >>> releases everything. >> >> I'd prefer not block for no reason at all. > > The reason to block is to wait for I/O to complete. Can you > elaborate: for example, are you leery of an extra context > switch? > > Note that in the Read/recvfrom case, RDMA Reads are posted, > and then svc_rdma_recvfrom returns "deferred." When the Reads > complete, the completion handler enqueues the work on the > svc_xprt and svc_rdma_recvfrom is called again. Can someone > explain why svc_rdma_recvfrom doesn't just wait for the > RDMA Reads to complete? > > What is the purpose of pulling the pages out of the svc_rqst > while waiting for the RDMA Reads? > > >>> Another option would be for ->sendto to return a value that means >>> the transport will release the svc_rqst and pages. >> >> Or just let the transport always release it. We only have two >> different implementations of the relevant ops anyway. > > If each transport is responsible for releasing svc_rqst and > pages, code is duplicated. Both transports need to do this > releasing, so it should be done in common code (the svc RPC > code). > > I'd prefer not to have to touch the TCP transport in order > to improve the RDMA transport implementation. However, some > clean-up and refactoring in this area may be unavoidable. svc_send does this: 938 mutex_lock(&xprt->xpt_mutex); 943 len = xprt->xpt_ops->xpo_sendto(rqstp); 944 mutex_unlock(&xprt->xpt_mutex); 945 rpc_wake_up(&xprt->xpt_bc_pending); 946 svc_xprt_release(rqstp); Thus waiting in sendto is indeed not good: it would hold up other replies. But I think you're suggesting that svc_xprt_release() should be invoked by the transport, and not by svc_send(). That could work if there isn't a hidden dependency on the ordering of lines 944-946. The socket transport appears to be able to complete send processing synchronously. Invoking svc_xprt_release() inside the mutex critical section would probably have a negative performance impact. We could add another xpo callout here. For sockets it would invoke svc_xprt_release, and for RDMA it would be a no-op. RDMA would then be responsible for invoking svc_xprt_release in its send completion handler. -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: nfsd: managing pages under network I/O 2016-10-12 17:42 nfsd: managing pages under network I/O Chuck Lever 2016-10-13 6:35 ` Christoph Hellwig @ 2016-10-19 17:26 ` J. Bruce Fields 2016-10-19 18:28 ` Chuck Lever 1 sibling, 1 reply; 9+ messages in thread From: J. Bruce Fields @ 2016-10-19 17:26 UTC (permalink / raw) To: Chuck Lever; +Cc: Linux NFS Mailing List On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote: > I'm studying the way that the ->recvfrom and ->sendto calls work > for RPC-over-RDMA. > > The ->sendto path moves pages out of the svc_rqst before posting > I/O (RDMA Write and Send). Once the work is posted, ->sendto > returns, and looks like svc_rqst is released at that point. The > subsequent completion of the Send then releases those moved pages. > > I'm wondering if the transport can be simplified: instead of > moving pages around, ->sendto could just wait until the Write and > Send activity is complete, then return. The upper layer then > releases everything. I don't understand what problem you're trying to fix. Is "moving pages around" really especially complicated or expensive? --b. > > Another option would be for ->sendto to return a value that means > the transport will release the svc_rqst and pages. > > Or, the svc_rqst could be reference counted. > > Anyone have thoughts about this? > > -- > Chuck Lever > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: nfsd: managing pages under network I/O 2016-10-19 17:26 ` J. Bruce Fields @ 2016-10-19 18:28 ` Chuck Lever 2016-10-19 19:16 ` J. Bruce Fields 0 siblings, 1 reply; 9+ messages in thread From: Chuck Lever @ 2016-10-19 18:28 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS Mailing List > On Oct 19, 2016, at 1:26 PM, bfields@fieldses.org wrote: > > On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote: >> I'm studying the way that the ->recvfrom and ->sendto calls work >> for RPC-over-RDMA. >> >> The ->sendto path moves pages out of the svc_rqst before posting >> I/O (RDMA Write and Send). Once the work is posted, ->sendto >> returns, and looks like svc_rqst is released at that point. The >> subsequent completion of the Send then releases those moved pages. >> >> I'm wondering if the transport can be simplified: instead of >> moving pages around, ->sendto could just wait until the Write and >> Send activity is complete, then return. The upper layer then >> releases everything. > > I don't understand what problem you're trying to fix. Is "moving pages > around" really especially complicated or expensive? Moving pages is complicated and undocumented, and adds host CPU and memory costs (loads and stores). There is also a whole lot of page allocator traffic in this code, and that will be enabling and disabling interrupts and bottom-halves with some frequency. The less involvement by the host CPU the better. The current RDMA transport code documents the RPC-over-RDMA protocol to some extent, but the need for moving the pages is to the reader's imagination. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: nfsd: managing pages under network I/O 2016-10-19 18:28 ` Chuck Lever @ 2016-10-19 19:16 ` J. Bruce Fields 2016-10-19 19:21 ` Chuck Lever 0 siblings, 1 reply; 9+ messages in thread From: J. Bruce Fields @ 2016-10-19 19:16 UTC (permalink / raw) To: Chuck Lever; +Cc: Linux NFS Mailing List On Wed, Oct 19, 2016 at 02:28:40PM -0400, Chuck Lever wrote: > > > On Oct 19, 2016, at 1:26 PM, bfields@fieldses.org wrote: > > > > On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote: > >> I'm studying the way that the ->recvfrom and ->sendto calls work > >> for RPC-over-RDMA. > >> > >> The ->sendto path moves pages out of the svc_rqst before posting > >> I/O (RDMA Write and Send). Once the work is posted, ->sendto > >> returns, and looks like svc_rqst is released at that point. The > >> subsequent completion of the Send then releases those moved pages. > >> > >> I'm wondering if the transport can be simplified: instead of > >> moving pages around, ->sendto could just wait until the Write and > >> Send activity is complete, then return. The upper layer then > >> releases everything. > > > > I don't understand what problem you're trying to fix. Is "moving pages > > around" really especially complicated or expensive? > > Moving pages is complicated and undocumented, and adds host CPU and > memory costs (loads and stores). There is also a whole lot of page > allocator traffic in this code, and that will be enabling and > disabling interrupts and bottom-halves with some frequency. I thought "moving pages around" here is basically just this, from isvc_rdma_sendto.c:send_reply(): pages = rqstp->rq_next_page - rqstp->rq_respages; for (page_no = 0; page_no < pages; page_no++) { ctxt->pages[page_no+1] = rqstp->rq_respages[page_no]; ... rqstp->rq_respages[page_no] = NULL; ... } So we're just copying an array of page pointers from one place to another, and zeroing out the source array. For a short reply that could be 1 page or even none. In the worst case (a 1MB read result) that could be 256 8-byte pointers, so 2K. Am I missing something? Has that up-to-2K operation been a problem? --b. > > The less involvement by the host CPU the better. The current RDMA > transport code documents the RPC-over-RDMA protocol to some extent, > but the need for moving the pages is to the reader's imagination. > > From my code audit: > > The svc_rqst is essentially a static piece of memory, and the calling > NFSD kthread will re-use it immediately when svc_sendto() returns. It > cannot be manipulated asynchronously. The recvfrom calls and the > sendto call for one RPC request can each use a different svc_rqst, > for example. So if the transport needs to use pages for longer the > length of one svc_sendto call (for example, to wait for RDMA Write to > complete) then it has to move those pages out of the svc_rqst before > it returns. (Let me know if I've got this wrong). > > I'm falling back to looking at ways to clean up and document the > movement of these pages so it is better understood why it is done. > There also could be some opportunity to share page movement helpers > between the RDMA read and write path (and perhaps the TCP transport > too). > > HTH. > > > > --b. > > > >> > >> Another option would be for ->sendto to return a value that means > >> the transport will release the svc_rqst and pages. > >> > >> Or, the svc_rqst could be reference counted. > >> > >> Anyone have thoughts about this? > > > -- > Chuck Lever > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: nfsd: managing pages under network I/O 2016-10-19 19:16 ` J. Bruce Fields @ 2016-10-19 19:21 ` Chuck Lever 2016-10-19 19:26 ` J. Bruce Fields 0 siblings, 1 reply; 9+ messages in thread From: Chuck Lever @ 2016-10-19 19:21 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS Mailing List > On Oct 19, 2016, at 3:16 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > > On Wed, Oct 19, 2016 at 02:28:40PM -0400, Chuck Lever wrote: >> >>> On Oct 19, 2016, at 1:26 PM, bfields@fieldses.org wrote: >>> >>> On Wed, Oct 12, 2016 at 01:42:26PM -0400, Chuck Lever wrote: >>>> I'm studying the way that the ->recvfrom and ->sendto calls work >>>> for RPC-over-RDMA. >>>> >>>> The ->sendto path moves pages out of the svc_rqst before posting >>>> I/O (RDMA Write and Send). Once the work is posted, ->sendto >>>> returns, and looks like svc_rqst is released at that point. The >>>> subsequent completion of the Send then releases those moved pages. >>>> >>>> I'm wondering if the transport can be simplified: instead of >>>> moving pages around, ->sendto could just wait until the Write and >>>> Send activity is complete, then return. The upper layer then >>>> releases everything. >>> >>> I don't understand what problem you're trying to fix. Is "moving pages >>> around" really especially complicated or expensive? >> >> Moving pages is complicated and undocumented, and adds host CPU and >> memory costs (loads and stores). There is also a whole lot of page >> allocator traffic in this code, and that will be enabling and >> disabling interrupts and bottom-halves with some frequency. > > I thought "moving pages around" here is basically just this, from > isvc_rdma_sendto.c:send_reply(): > > pages = rqstp->rq_next_page - rqstp->rq_respages; > for (page_no = 0; page_no < pages; page_no++) { > ctxt->pages[page_no+1] = rqstp->rq_respages[page_no]; > ... > rqstp->rq_respages[page_no] = NULL; > ... > } > > So we're just copying an array of page pointers from one place to > another, and zeroing out the source array. > > For a short reply that could be 1 page or even none. In the worst case > (a 1MB read result) that could be 256 8-byte pointers, so 2K. > > Am I missing something? Has that up-to-2K operation been a problem? Not a problem, but not optimal either. Like I said, I can put together some helpers so that this code is not duplicated, and the call-sites would be a little easier to understand. > --b. > >> >> The less involvement by the host CPU the better. The current RDMA >> transport code documents the RPC-over-RDMA protocol to some extent, >> but the need for moving the pages is to the reader's imagination. >> >> From my code audit: >> >> The svc_rqst is essentially a static piece of memory, and the calling >> NFSD kthread will re-use it immediately when svc_sendto() returns. It >> cannot be manipulated asynchronously. The recvfrom calls and the >> sendto call for one RPC request can each use a different svc_rqst, >> for example. So if the transport needs to use pages for longer the >> length of one svc_sendto call (for example, to wait for RDMA Write to >> complete) then it has to move those pages out of the svc_rqst before >> it returns. (Let me know if I've got this wrong). >> >> I'm falling back to looking at ways to clean up and document the >> movement of these pages so it is better understood why it is done. >> There also could be some opportunity to share page movement helpers >> between the RDMA read and write path (and perhaps the TCP transport >> too). >> >> HTH. >> >> >>> --b. >>> >>>> >>>> Another option would be for ->sendto to return a value that means >>>> the transport will release the svc_rqst and pages. >>>> >>>> Or, the svc_rqst could be reference counted. >>>> >>>> Anyone have thoughts about this? >> >> >> -- >> Chuck Lever >> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: nfsd: managing pages under network I/O 2016-10-19 19:21 ` Chuck Lever @ 2016-10-19 19:26 ` J. Bruce Fields 0 siblings, 0 replies; 9+ messages in thread From: J. Bruce Fields @ 2016-10-19 19:26 UTC (permalink / raw) To: Chuck Lever; +Cc: Linux NFS Mailing List On Wed, Oct 19, 2016 at 03:21:02PM -0400, Chuck Lever wrote: > > On Oct 19, 2016, at 3:16 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > > I thought "moving pages around" here is basically just this, from > > isvc_rdma_sendto.c:send_reply(): > > > > pages = rqstp->rq_next_page - rqstp->rq_respages; > > for (page_no = 0; page_no < pages; page_no++) { > > ctxt->pages[page_no+1] = rqstp->rq_respages[page_no]; > > ... > > rqstp->rq_respages[page_no] = NULL; > > ... > > } > > > > So we're just copying an array of page pointers from one place to > > another, and zeroing out the source array. > > > > For a short reply that could be 1 page or even none. In the worst case > > (a 1MB read result) that could be 256 8-byte pointers, so 2K. > > > > Am I missing something? Has that up-to-2K operation been a problem? > > Not a problem, but not optimal either. Like I said, I can put together > some helpers so that this code is not duplicated, and the call-sites > would be a little easier to understand. Sure, I've no objection if you find some cleanup that makes sense. --b. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-10-19 19:26 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-10-12 17:42 nfsd: managing pages under network I/O Chuck Lever 2016-10-13 6:35 ` Christoph Hellwig 2016-10-13 13:36 ` Chuck Lever 2016-10-13 14:32 ` Chuck Lever 2016-10-19 17:26 ` J. Bruce Fields 2016-10-19 18:28 ` Chuck Lever 2016-10-19 19:16 ` J. Bruce Fields 2016-10-19 19:21 ` Chuck Lever 2016-10-19 19:26 ` J. Bruce Fields
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).