* Re: [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup [not found] <12120836962076-git-send-email-tom@opengridcomputing.com> @ 2008-05-29 22:10 ` J. Bruce Fields 2008-05-29 22:25 ` Tom Tucker [not found] ` <12120836962324-git-send-email-tom@opengridcomputing.com> 1 sibling, 1 reply; 20+ messages in thread From: J. Bruce Fields @ 2008-05-29 22:10 UTC (permalink / raw) To: Tom Tucker; +Cc: linux-nfs On Thu, May 29, 2008 at 12:54:45PM -0500, Tom Tucker wrote: > Bruce: > > This set of patches fixes bugs related to DMA mapping and unmapping and > in the process redesigns and cleans up the way that WR contexts are managed. > > In addition to the bug fixes, there are three major changes: > > - The data structure that maps client side memory to server side > memory has been separated from the WR context. This makes for cleaner > type checking and a smaller memory footprint for the RDMA transport. > > - The mapping and unmapping of DMA is performed independently from the > allocation and deallocation of the WR context. This is to allow the > unmapping to occur early on recv where the unmapping of the DMA needs > to be done prior to looking at the data and therefore before > deallocating the WR context. > > - The WR context cache now uses a kmem_cache and is shared across > all transport instances. This allows contexts to be shared between > mounts and idle mounts don't consume context memory. > > This patchset is intended for 2.6.27 and is based on 2.6.26-rc4. These patches > are also available at the following git url: OK, thanks. Bug me if I haven't commented or applied these by next week. Is there anything else remaining for 2.6.26 (as opposed to .27)? --b. > > git://git.linux-nfs.org/projects/tomtucker/xprt-switch-2.6.git > > [PATCH 01/11] svcrdma: Add a type for keeping NFS request mapping > > include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++ > net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++++++ > net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++ > 3 files changed, 76 insertions(+), 0 deletions(-) > > [PATCH 02/11] svcrdma: Use rpc reply map for RDMA_WRITE processing > > net/sunrpc/xprtrdma/svc_rdma_sendto.c | 122 ++++++++++++++---------------- > net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +- > 2 files changed, 62 insertions(+), 65 deletions(-) > > [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing > > include/linux/sunrpc/svc_rdma.h | 1 + > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 83 ++++++++++++++----------------- > 2 files changed, 39 insertions(+), 45 deletions(-) > > [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 20 ++++++++++++++------ > 1 files changed, 14 insertions(+), 6 deletions(-) > > [PATCH 05/11] svcrdma: Add dma map count and WARN_ON > > include/linux/sunrpc/svc_rdma.h | 1 + > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 + > net/sunrpc/xprtrdma/svc_rdma_sendto.c | 3 +++ > net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +++++ > 4 files changed, 10 insertions(+), 0 deletions(-) > > [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 ---- > 1 files changed, 0 insertions(+), 4 deletions(-) > > [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure > > include/linux/sunrpc/svc_rdma.h | 1 - > 1 files changed, 0 insertions(+), 1 deletions(-) > > [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 16 ++++++++++++---- > 1 files changed, 12 insertions(+), 4 deletions(-) > > [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function > > net/sunrpc/xprtrdma/svc_rdma.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts > > net/sunrpc/xprtrdma/svc_rdma.c | 24 ++++++++++++++++++++---- > 1 files changed, 20 insertions(+), 4 deletions(-) > > [PATCH 11/11] svcrdma: Change WR context get/put to use the kmem cache > > include/linux/sunrpc/svc_rdma.h | 6 -- > net/sunrpc/xprtrdma/svc_rdma_transport.c | 121 +++--------------------------- > 2 files changed, 12 insertions(+), 115 deletions(-) > > Signed-off-by: Tom Tucker <tom@opengridcomputing.com> > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup 2008-05-29 22:10 ` [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup J. Bruce Fields @ 2008-05-29 22:25 ` Tom Tucker [not found] ` <1212099937.22478.3.camel-SMNkleLxa3ZimH42XvhXlA@public.gmane.org> 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-05-29 22:25 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Roland Dreier, swise On Thu, 2008-05-29 at 18:10 -0400, J. Bruce Fields wrote: > On Thu, May 29, 2008 at 12:54:45PM -0500, Tom Tucker wrote: > > Bruce: > > > > This set of patches fixes bugs related to DMA mapping and unmapping and > > in the process redesigns and cleans up the way that WR contexts are managed. > > > > In addition to the bug fixes, there are three major changes: > > > > - The data structure that maps client side memory to server side > > memory has been separated from the WR context. This makes for cleaner > > type checking and a smaller memory footprint for the RDMA transport. > > > > - The mapping and unmapping of DMA is performed independently from the > > allocation and deallocation of the WR context. This is to allow the > > unmapping to occur early on recv where the unmapping of the DMA needs > > to be done prior to looking at the data and therefore before > > deallocating the WR context. > > > > - The WR context cache now uses a kmem_cache and is shared across > > all transport instances. This allows contexts to be shared between > > mounts and idle mounts don't consume context memory. > > > > This patchset is intended for 2.6.27 and is based on 2.6.26-rc4. These patches > > are also available at the following git url: > > OK, thanks. Bug me if I haven't commented or applied these by next > week. > > Is there anything else remaining for 2.6.26 (as opposed to .27)? I'm inclined to say "no". But I'd like to hear if anyone out there is dead in the water due to a bug in the 2.6.26 version. Tom > > --b. > > > > > git://git.linux-nfs.org/projects/tomtucker/xprt-switch-2.6.git > > > > [PATCH 01/11] svcrdma: Add a type for keeping NFS request mapping > > > > include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++ > > net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++++++ > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++ > > 3 files changed, 76 insertions(+), 0 deletions(-) > > > > [PATCH 02/11] svcrdma: Use rpc reply map for RDMA_WRITE processing > > > > net/sunrpc/xprtrdma/svc_rdma_sendto.c | 122 ++++++++++++++---------------- > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +- > > 2 files changed, 62 insertions(+), 65 deletions(-) > > > > [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing > > > > include/linux/sunrpc/svc_rdma.h | 1 + > > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 83 ++++++++++++++----------------- > > 2 files changed, 39 insertions(+), 45 deletions(-) > > > > [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler > > > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 20 ++++++++++++++------ > > 1 files changed, 14 insertions(+), 6 deletions(-) > > > > [PATCH 05/11] svcrdma: Add dma map count and WARN_ON > > > > include/linux/sunrpc/svc_rdma.h | 1 + > > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 + > > net/sunrpc/xprtrdma/svc_rdma_sendto.c | 3 +++ > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +++++ > > 4 files changed, 10 insertions(+), 0 deletions(-) > > > > [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free > > > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 ---- > > 1 files changed, 0 insertions(+), 4 deletions(-) > > > > [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure > > > > include/linux/sunrpc/svc_rdma.h | 1 - > > 1 files changed, 0 insertions(+), 1 deletions(-) > > > > [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD > > > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 16 ++++++++++++---- > > 1 files changed, 12 insertions(+), 4 deletions(-) > > > > [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function > > > > net/sunrpc/xprtrdma/svc_rdma.c | 1 + > > 1 files changed, 1 insertions(+), 0 deletions(-) > > > > [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts > > > > net/sunrpc/xprtrdma/svc_rdma.c | 24 ++++++++++++++++++++---- > > 1 files changed, 20 insertions(+), 4 deletions(-) > > > > [PATCH 11/11] svcrdma: Change WR context get/put to use the kmem cache > > > > include/linux/sunrpc/svc_rdma.h | 6 -- > > net/sunrpc/xprtrdma/svc_rdma_transport.c | 121 +++--------------------------- > > 2 files changed, 12 insertions(+), 115 deletions(-) > > > > Signed-off-by: Tom Tucker <tom@opengridcomputing.com> > > ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <1212099937.22478.3.camel-SMNkleLxa3ZimH42XvhXlA@public.gmane.org>]
* Re: [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup [not found] ` <1212099937.22478.3.camel-SMNkleLxa3ZimH42XvhXlA@public.gmane.org> @ 2008-05-29 22:26 ` Roland Dreier 0 siblings, 0 replies; 20+ messages in thread From: Roland Dreier @ 2008-05-29 22:26 UTC (permalink / raw) To: Tom Tucker; +Cc: J. Bruce Fields, linux-nfs, swise > I'm inclined to say "no". But I'd like to hear if anyone out there is > dead in the water due to a bug in the 2.6.26 version. To be honest I haven't had a chance to do much NFS/RDMA testing with the current 2.6.26 tree. But last I looked there were no major regressions at least (still open issues I think, but definitely better than 2.6.25) ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <12120836962324-git-send-email-tom@opengridcomputing.com>]
* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping [not found] ` <12120836962324-git-send-email-tom@opengridcomputing.com> @ 2008-06-16 19:48 ` J. Bruce Fields 2008-06-21 16:31 ` Tom Tucker [not found] ` <12120836963727-git-send-email-tom@opengridcomputing.com> 1 sibling, 1 reply; 20+ messages in thread From: J. Bruce Fields @ 2008-06-16 19:48 UTC (permalink / raw) To: Tom Tucker; +Cc: linux-nfs On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote: > Create a new data structure to hold the remote client address space > to local server address space mapping. > > Signed-off-by: Tom Tucker <tom@opengridcomputing.com> > > --- > include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++ > net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++++++ > net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++ > 3 files changed, 76 insertions(+), 0 deletions(-) > > diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h > index 05eb466..bd8749c 100644 > --- a/include/linux/sunrpc/svc_rdma.h > +++ b/include/linux/sunrpc/svc_rdma.h > @@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt { > struct page *pages[RPCSVC_MAXPAGES]; > }; > > +/* > + * NFS_ requests are mapped on the client side by the chunk lists in > + * the RPCRDMA header. During the fetching of the RPC from the client > + * and the writing of the reply to the client, the memory in the > + * client and the memory in the server must be mapped as contiguous > + * vaddr/len for access by the hardware. These data strucures keep > + * these mappings. > + * > + * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the > + * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the > + * 'ch' field maps the read-list of the RPCRDMA header to the 'sge' > + * mapping of the reply. > + */ > +struct svc_rdma_chunk_sge { > + int start; /* sge no for this chunk */ > + int count; /* sge count for this chunk */ > +}; > +struct svc_rdma_req_map { > + unsigned long count; > + union { > + struct kvec sge[RPCSVC_MAXPAGES]; > + struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES]; > + }; > +}; > + > #define RDMACTXT_F_LAST_CTXT 2 > > struct svcxprt_rdma { > @@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *); > extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *); > extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *); > extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int); > +extern struct svc_rdma_req_map *svc_rdma_get_req_map(void); > +extern void svc_rdma_put_req_map(struct svc_rdma_req_map *); > extern void svc_sq_reap(struct svcxprt_rdma *); > extern void svc_rq_reap(struct svcxprt_rdma *); > extern struct svc_xprt_class svc_rdma_class; > diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c > index 88c0ca2..545ea72 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma.c > +++ b/net/sunrpc/xprtrdma/svc_rdma.c > @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod; > atomic_t rdma_stat_sq_poll; > atomic_t rdma_stat_sq_prod; > > +/* Temporary NFS request map cache */ > +struct kmem_cache *svc_rdma_map_cachep = NULL; No need to initialize globals to NULL. > + > /* > * This function implements reading and resetting an atomic_t stat > * variable through read/write to a proc file. Any write to the file > @@ -241,6 +244,8 @@ void svc_rdma_cleanup(void) > svcrdma_table_header = NULL; > } > svc_unreg_xprt_class(&svc_rdma_class); > + if (svc_rdma_map_cachep) > + kmem_cache_destroy(svc_rdma_map_cachep); By design this can't be NULL (and the code presumably would have oopsed much earlier if it were mistakenly allowed to be) so I'm inclined not to bother checking.... > } > > int svc_rdma_init(void) > @@ -255,9 +260,27 @@ int svc_rdma_init(void) > svcrdma_table_header = > register_sysctl_table(svcrdma_root_table); > > + /* Create the temporary map cache */ > + svc_rdma_map_cachep = kmem_cache_create("svc_rdma_map_cache", > + sizeof(struct svc_rdma_req_map), > + 0, > + SLAB_HWCACHE_ALIGN, > + NULL); > + if (!svc_rdma_map_cachep) { > + printk(KERN_INFO "Could not allocate map cache.\n"); > + goto err; > + } > + > /* Register RDMA with the SVC transport switch */ > svc_reg_xprt_class(&svc_rdma_class); > return 0; > + > + err: > + if (svcrdma_table_header) { > + unregister_sysctl_table(svcrdma_table_header); unregister_sysctl_table() already handles the NULL case, so you could skip the if (svcrdma_table_header). --b. > + svcrdma_table_header = NULL; > + } > + return -ENOMEM; > } > MODULE_AUTHOR("Tom Tucker <tom@opengridcomputing.com>"); > MODULE_DESCRIPTION("SVC RDMA Transport"); > diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c > index e132509..ae90758 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c > @@ -173,6 +173,32 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages) > atomic_dec(&xprt->sc_ctxt_used); > } > > +/* Temporary NFS request map cache. Created in svc_rdma.c */ > +extern struct kmem_cache *svc_rdma_map_cachep; > + > +/* > + * Temporary NFS req mappings are shared across all transport > + * instances. These are short lived and should be bounded by the number > + * of concurrent server threads * depth of the SQ. > + */ > +struct svc_rdma_req_map *svc_rdma_get_req_map(void) > +{ > + struct svc_rdma_req_map *map; > + while (1) { > + map = kmem_cache_alloc(svc_rdma_map_cachep, GFP_KERNEL); > + if (map) > + break; > + schedule_timeout_uninterruptible(msecs_to_jiffies(500)); > + } > + map->count = 0; > + return map; > +} > + > +void svc_rdma_put_req_map(struct svc_rdma_req_map *map) > +{ > + kmem_cache_free(svc_rdma_map_cachep, map); > +} > + > /* ib_cq event handler */ > static void cq_event_handler(struct ib_event *event, void *context) > { ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping 2008-06-16 19:48 ` [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping J. Bruce Fields @ 2008-06-21 16:31 ` Tom Tucker 2008-06-23 18:27 ` J. Bruce Fields 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-06-21 16:31 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs J. Bruce Fields wrote: > On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote: > >> Create a new data structure to hold the remote client address space >> to local server address space mapping. >> >> Signed-off-by: Tom Tucker <tom@opengridcomputing.com> >> >> --- >> include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++ >> net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++++++ >> net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++ >> 3 files changed, 76 insertions(+), 0 deletions(-) >> >> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h >> index 05eb466..bd8749c 100644 >> --- a/include/linux/sunrpc/svc_rdma.h >> +++ b/include/linux/sunrpc/svc_rdma.h >> @@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt { >> struct page *pages[RPCSVC_MAXPAGES]; >> }; >> >> +/* >> + * NFS_ requests are mapped on the client side by the chunk lists in >> + * the RPCRDMA header. During the fetching of the RPC from the client >> + * and the writing of the reply to the client, the memory in the >> + * client and the memory in the server must be mapped as contiguous >> + * vaddr/len for access by the hardware. These data strucures keep >> + * these mappings. >> + * >> + * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the >> + * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the >> + * 'ch' field maps the read-list of the RPCRDMA header to the 'sge' >> + * mapping of the reply. >> + */ >> +struct svc_rdma_chunk_sge { >> + int start; /* sge no for this chunk */ >> + int count; /* sge count for this chunk */ >> +}; >> +struct svc_rdma_req_map { >> + unsigned long count; >> + union { >> + struct kvec sge[RPCSVC_MAXPAGES]; >> + struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES]; >> + }; >> +}; >> + >> #define RDMACTXT_F_LAST_CTXT 2 >> >> struct svcxprt_rdma { >> @@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *); >> extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *); >> extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *); >> extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int); >> +extern struct svc_rdma_req_map *svc_rdma_get_req_map(void); >> +extern void svc_rdma_put_req_map(struct svc_rdma_req_map *); >> extern void svc_sq_reap(struct svcxprt_rdma *); >> extern void svc_rq_reap(struct svcxprt_rdma *); >> extern struct svc_xprt_class svc_rdma_class; >> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c >> index 88c0ca2..545ea72 100644 >> --- a/net/sunrpc/xprtrdma/svc_rdma.c >> +++ b/net/sunrpc/xprtrdma/svc_rdma.c >> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod; >> atomic_t rdma_stat_sq_poll; >> atomic_t rdma_stat_sq_prod; >> >> +/* Temporary NFS request map cache */ >> +struct kmem_cache *svc_rdma_map_cachep = NULL; >> > > No need to initialize globals to NULL. > > I thought only static objects were initialized per the C standard. Or are you saying that this particular global doesn't need to be initialized because of the way it is used? >> + >> /* >> * This function implements reading and resetting an atomic_t stat >> * variable through read/write to a proc file. Any write to the file >> @@ -241,6 +244,8 @@ void svc_rdma_cleanup(void) >> svcrdma_table_header = NULL; >> } >> svc_unreg_xprt_class(&svc_rdma_class); >> + if (svc_rdma_map_cachep) >> + kmem_cache_destroy(svc_rdma_map_cachep); >> > > By design this can't be NULL (and the code presumably would have oopsed > much earlier if it were mistakenly allowed to be) so I'm inclined not to > bother checking.... > > Ok. >> } >> >> int svc_rdma_init(void) >> @@ -255,9 +260,27 @@ int svc_rdma_init(void) >> svcrdma_table_header = >> register_sysctl_table(svcrdma_root_table); >> >> + /* Create the temporary map cache */ >> + svc_rdma_map_cachep = kmem_cache_create("svc_rdma_map_cache", >> + sizeof(struct svc_rdma_req_map), >> + 0, >> + SLAB_HWCACHE_ALIGN, >> + NULL); >> + if (!svc_rdma_map_cachep) { >> + printk(KERN_INFO "Could not allocate map cache.\n"); >> + goto err; >> + } >> + >> /* Register RDMA with the SVC transport switch */ >> svc_reg_xprt_class(&svc_rdma_class); >> return 0; >> + >> + err: >> + if (svcrdma_table_header) { >> + unregister_sysctl_table(svcrdma_table_header); >> > > unregister_sysctl_table() already handles the NULL case, so you could > skip the if (svcrdma_table_header). > Ok. > --b. > > >> + svcrdma_table_header = NULL; >> + } >> + return -ENOMEM; >> } >> MODULE_AUTHOR("Tom Tucker <tom@opengridcomputing.com>"); >> MODULE_DESCRIPTION("SVC RDMA Transport"); >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c >> index e132509..ae90758 100644 >> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c >> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c >> @@ -173,6 +173,32 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages) >> atomic_dec(&xprt->sc_ctxt_used); >> } >> >> +/* Temporary NFS request map cache. Created in svc_rdma.c */ >> +extern struct kmem_cache *svc_rdma_map_cachep; >> + >> +/* >> + * Temporary NFS req mappings are shared across all transport >> + * instances. These are short lived and should be bounded by the number >> + * of concurrent server threads * depth of the SQ. >> + */ >> +struct svc_rdma_req_map *svc_rdma_get_req_map(void) >> +{ >> + struct svc_rdma_req_map *map; >> + while (1) { >> + map = kmem_cache_alloc(svc_rdma_map_cachep, GFP_KERNEL); >> + if (map) >> + break; >> + schedule_timeout_uninterruptible(msecs_to_jiffies(500)); >> + } >> + map->count = 0; >> + return map; >> +} >> + >> +void svc_rdma_put_req_map(struct svc_rdma_req_map *map) >> +{ >> + kmem_cache_free(svc_rdma_map_cachep, map); >> +} >> + >> /* ib_cq event handler */ >> static void cq_event_handler(struct ib_event *event, void *context) >> { >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping 2008-06-21 16:31 ` Tom Tucker @ 2008-06-23 18:27 ` J. Bruce Fields 2008-06-24 2:58 ` Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: J. Bruce Fields @ 2008-06-23 18:27 UTC (permalink / raw) To: Tom Tucker; +Cc: linux-nfs On Sat, Jun 21, 2008 at 11:31:43AM -0500, Tom Tucker wrote: > J. Bruce Fields wrote: >> On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote: >> >>> Create a new data structure to hold the remote client address space >>> to local server address space mapping. >>> >>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com> >>> >>> --- >>> include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++ >>> net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++++++ >>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++ >>> 3 files changed, 76 insertions(+), 0 deletions(-) >>> >>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h >>> index 05eb466..bd8749c 100644 >>> --- a/include/linux/sunrpc/svc_rdma.h >>> +++ b/include/linux/sunrpc/svc_rdma.h >>> @@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt { >>> struct page *pages[RPCSVC_MAXPAGES]; >>> }; >>> +/* >>> + * NFS_ requests are mapped on the client side by the chunk lists in >>> + * the RPCRDMA header. During the fetching of the RPC from the client >>> + * and the writing of the reply to the client, the memory in the >>> + * client and the memory in the server must be mapped as contiguous >>> + * vaddr/len for access by the hardware. These data strucures keep >>> + * these mappings. >>> + * >>> + * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the >>> + * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the >>> + * 'ch' field maps the read-list of the RPCRDMA header to the 'sge' >>> + * mapping of the reply. >>> + */ >>> +struct svc_rdma_chunk_sge { >>> + int start; /* sge no for this chunk */ >>> + int count; /* sge count for this chunk */ >>> +}; >>> +struct svc_rdma_req_map { >>> + unsigned long count; >>> + union { >>> + struct kvec sge[RPCSVC_MAXPAGES]; >>> + struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES]; >>> + }; >>> +}; >>> + >>> #define RDMACTXT_F_LAST_CTXT 2 >>> struct svcxprt_rdma { >>> @@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *); >>> extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *); >>> extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *); >>> extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int); >>> +extern struct svc_rdma_req_map *svc_rdma_get_req_map(void); >>> +extern void svc_rdma_put_req_map(struct svc_rdma_req_map *); >>> extern void svc_sq_reap(struct svcxprt_rdma *); >>> extern void svc_rq_reap(struct svcxprt_rdma *); >>> extern struct svc_xprt_class svc_rdma_class; >>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c >>> index 88c0ca2..545ea72 100644 >>> --- a/net/sunrpc/xprtrdma/svc_rdma.c >>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c >>> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod; >>> atomic_t rdma_stat_sq_poll; >>> atomic_t rdma_stat_sq_prod; >>> +/* Temporary NFS request map cache */ >>> +struct kmem_cache *svc_rdma_map_cachep = NULL; >>> >> >> No need to initialize globals to NULL. >> >> > I thought only static objects were initialized per the C standard. Or > are you saying that this particular > global doesn't need to be initialized because of the way it is used? I don't know if the initialization is mandated by the standards or whether it's just gcc behavior, but I know I get a complaint every time somebody finds one of those.... That may partly just be a preference for conciseness, but I think it may also allow gcc to put the thing in a different section and save some space in the on-disk kernel--I don't know. --b. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping 2008-06-23 18:27 ` J. Bruce Fields @ 2008-06-24 2:58 ` Tom Tucker 2008-06-24 19:58 ` J. Bruce Fields 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-06-24 2:58 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs J. Bruce Fields wrote: > On Sat, Jun 21, 2008 at 11:31:43AM -0500, Tom Tucker wrote: > >> J. Bruce Fields wrote: >> >>> On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote: >>> >>> >>>> Create a new data structure to hold the remote client address space >>>> to local server address space mapping. >>>> >>>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com> >>>> >>>> --- >>>> include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++ >>>> net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++++++ >>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++ >>>> 3 files changed, 76 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h >>>> index 05eb466..bd8749c 100644 >>>> --- a/include/linux/sunrpc/svc_rdma.h >>>> +++ b/include/linux/sunrpc/svc_rdma.h >>>> @@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt { >>>> struct page *pages[RPCSVC_MAXPAGES]; >>>> }; >>>> +/* >>>> + * NFS_ requests are mapped on the client side by the chunk lists in >>>> + * the RPCRDMA header. During the fetching of the RPC from the client >>>> + * and the writing of the reply to the client, the memory in the >>>> + * client and the memory in the server must be mapped as contiguous >>>> + * vaddr/len for access by the hardware. These data strucures keep >>>> + * these mappings. >>>> + * >>>> + * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the >>>> + * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the >>>> + * 'ch' field maps the read-list of the RPCRDMA header to the 'sge' >>>> + * mapping of the reply. >>>> + */ >>>> +struct svc_rdma_chunk_sge { >>>> + int start; /* sge no for this chunk */ >>>> + int count; /* sge count for this chunk */ >>>> +}; >>>> +struct svc_rdma_req_map { >>>> + unsigned long count; >>>> + union { >>>> + struct kvec sge[RPCSVC_MAXPAGES]; >>>> + struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES]; >>>> + }; >>>> +}; >>>> + >>>> #define RDMACTXT_F_LAST_CTXT 2 >>>> struct svcxprt_rdma { >>>> @@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *); >>>> extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *); >>>> extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *); >>>> extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int); >>>> +extern struct svc_rdma_req_map *svc_rdma_get_req_map(void); >>>> +extern void svc_rdma_put_req_map(struct svc_rdma_req_map *); >>>> extern void svc_sq_reap(struct svcxprt_rdma *); >>>> extern void svc_rq_reap(struct svcxprt_rdma *); >>>> extern struct svc_xprt_class svc_rdma_class; >>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c >>>> index 88c0ca2..545ea72 100644 >>>> --- a/net/sunrpc/xprtrdma/svc_rdma.c >>>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c >>>> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod; >>>> atomic_t rdma_stat_sq_poll; >>>> atomic_t rdma_stat_sq_prod; >>>> +/* Temporary NFS request map cache */ >>>> +struct kmem_cache *svc_rdma_map_cachep = NULL; >>>> >>>> >>> No need to initialize globals to NULL. >>> >>> >>> >> I thought only static objects were initialized per the C standard. Or >> are you saying that this particular >> global doesn't need to be initialized because of the way it is used? >> > > I don't know if the initialization is mandated by the standards or > whether it's just gcc behavior, but I know I get a complaint every time > somebody finds one of those.... > > In this case, the initialization is unnecessary, so it can be safely dumped. > That may partly just be a preference for conciseness, but I think it may > also allow gcc to put the thing in a different section and save some > space in the on-disk kernel--I don't know. > > Right -- un-initialized data goes in a different section, the .bss section in particular. Since the .bss section is not "initialized", there are no values (zeroes in this case) sitting in the object file ready to be mapped into whatever location becomes .bss. By contrast, the .data section contains initialized data and the initial values are sitting in the object file. So my question here is a little subtler: A. Do we discard _all_ zero initializations of non-static globals because we can safely assume that .bss is initialzed (not a fan of this), or B. Don't be an idiot and initialize objects unnecessarily because it bloats the kernel object file? An idiot voting for B, Tom > --b. > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping 2008-06-24 2:58 ` Tom Tucker @ 2008-06-24 19:58 ` J. Bruce Fields 2008-06-24 20:31 ` Benny Halevy 2008-06-24 20:38 ` Trond Myklebust 0 siblings, 2 replies; 20+ messages in thread From: J. Bruce Fields @ 2008-06-24 19:58 UTC (permalink / raw) To: Tom Tucker; +Cc: linux-nfs On Mon, Jun 23, 2008 at 09:58:21PM -0500, Tom Tucker wrote: > J. Bruce Fields wrote: >> On Sat, Jun 21, 2008 at 11:31:43AM -0500, Tom Tucker wrote: >> >>> J. Bruce Fields wrote: >>> >>>> On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote: >>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c >>>>> index 88c0ca2..545ea72 100644 >>>>> --- a/net/sunrpc/xprtrdma/svc_rdma.c >>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c >>>>> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod; >>>>> atomic_t rdma_stat_sq_poll; >>>>> atomic_t rdma_stat_sq_prod; >>>>> +/* Temporary NFS request map cache */ >>>>> +struct kmem_cache *svc_rdma_map_cachep = NULL; >>>>> >>>> No need to initialize globals to NULL. >>>> >>>> >>> I thought only static objects were initialized per the C standard. Or >>> are you saying that this particular >>> global doesn't need to be initialized because of the way it is used? >>> >> >> I don't know if the initialization is mandated by the standards or >> whether it's just gcc behavior, but I know I get a complaint every time >> somebody finds one of those.... >> >> > > In this case, the initialization is unnecessary, so it can be safely dumped. > >> That may partly just be a preference for conciseness, but I think it may >> also allow gcc to put the thing in a different section and save some >> space in the on-disk kernel--I don't know. >> >> > Right -- un-initialized data goes in a different section, the .bss > section in particular. Since the .bss section is not "initialized", > there are no values (zeroes in this case) sitting in the object file > ready to be mapped into whatever location becomes .bss. By contrast, the > .data section contains initialized data and the initial values are > sitting in the object file. > > So my question here is a little subtler: > > A. Do we discard _all_ zero initializations of non-static globals > because we can safely assume that .bss is initialzed (not a fan of > this), or > > B. Don't be an idiot and initialize objects unnecessarily because it > bloats the kernel object file? > > An idiot voting for B, My understanding is that it's A--e.g. checkpatch.pl has a dumb regex that just checks for these assignments and whines about them however they're used. Google doesn't find me any more detailed discussion of the policy. --b. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping 2008-06-24 19:58 ` J. Bruce Fields @ 2008-06-24 20:31 ` Benny Halevy 2008-06-24 20:38 ` Trond Myklebust 1 sibling, 0 replies; 20+ messages in thread From: Benny Halevy @ 2008-06-24 20:31 UTC (permalink / raw) To: J. Bruce Fields, Tom Tucker; +Cc: linux-nfs J. Bruce Fields wrote: > On Mon, Jun 23, 2008 at 09:58:21PM -0500, Tom Tucker wrote: >> J. Bruce Fields wrote: >>> On Sat, Jun 21, 2008 at 11:31:43AM -0500, Tom Tucker wrote: >>> >>>> J. Bruce Fields wrote: >>>> >>>>> On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote: >>>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c >>>>>> index 88c0ca2..545ea72 100644 >>>>>> --- a/net/sunrpc/xprtrdma/svc_rdma.c >>>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c >>>>>> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod; >>>>>> atomic_t rdma_stat_sq_poll; >>>>>> atomic_t rdma_stat_sq_prod; >>>>>> +/* Temporary NFS request map cache */ >>>>>> +struct kmem_cache *svc_rdma_map_cachep = NULL; >>>>>> >>>>> No need to initialize globals to NULL. >>>>> >>>>> >>>> I thought only static objects were initialized per the C standard. Or >>>> are you saying that this particular >>>> global doesn't need to be initialized because of the way it is used? >>>> >>> I don't know if the initialization is mandated by the standards or >>> whether it's just gcc behavior, but I know I get a complaint every time >>> somebody finds one of those.... The implicit initialization to zero was defined back then by K&R: http://www.cs.utah.edu/~phister/K_n_R/chapter4.html#s4.9 "In the absence of explicit initialization, external and static variables are guaranteed to be initialized to zero" AFAIK this convention was kept by the C standards ever since. >>> >>> >> In this case, the initialization is unnecessary, so it can be safely dumped. >> >>> That may partly just be a preference for conciseness, but I think it may >>> also allow gcc to put the thing in a different section and save some >>> space in the on-disk kernel--I don't know. >>> >>> >> Right -- un-initialized data goes in a different section, the .bss >> section in particular. Since the .bss section is not "initialized", >> there are no values (zeroes in this case) sitting in the object file >> ready to be mapped into whatever location becomes .bss. By contrast, the >> .data section contains initialized data and the initial values are >> sitting in the object file. >> >> So my question here is a little subtler: >> >> A. Do we discard _all_ zero initializations of non-static globals >> because we can safely assume that .bss is initialzed (not a fan of >> this), or Also static locals... What matters is that the variable has static allocation, its scope is irrelevant to the initialization issue. Benny >> >> B. Don't be an idiot and initialize objects unnecessarily because it >> bloats the kernel object file? >> >> An idiot voting for B, > > My understanding is that it's A--e.g. checkpatch.pl has a dumb regex > that just checks for these assignments and whines about them however > they're used. > > Google doesn't find me any more detailed discussion of the policy. > > --b. > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping 2008-06-24 19:58 ` J. Bruce Fields 2008-06-24 20:31 ` Benny Halevy @ 2008-06-24 20:38 ` Trond Myklebust 1 sibling, 0 replies; 20+ messages in thread From: Trond Myklebust @ 2008-06-24 20:38 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Tom Tucker, linux-nfs On Tue, 2008-06-24 at 15:58 -0400, J. Bruce Fields wrote: > My understanding is that it's A--e.g. checkpatch.pl has a dumb regex > that just checks for these assignments and whines about them however > they're used. > > Google doesn't find me any more detailed discussion of the policy. We had that discussion about 10 years ago. What followed was a general purge of all zero initialisations of global and static variables in the kernel, and the official policy is that initialising those variables is wrong, and should _not_ be done. The reason is that BSS is automatically zeroed out, and so adding an initialiser causes unnecessary code bloat and slows down the boot process. Trond ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <12120836963727-git-send-email-tom@opengridcomputing.com>]
* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing [not found] ` <12120836963727-git-send-email-tom@opengridcomputing.com> @ 2008-06-16 21:04 ` J. Bruce Fields 2008-06-21 16:26 ` Tom Tucker 2008-06-21 16:51 ` Tom Tucker [not found] ` <1212083697950-git-send-email-tom@opengridcomputing.com> 1 sibling, 2 replies; 20+ messages in thread From: J. Bruce Fields @ 2008-06-16 21:04 UTC (permalink / raw) To: Tom Tucker; +Cc: linux-nfs On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote: > Use the new svc_rdma_req_map data type for mapping the client side memory > to the server side memory. Move the DMA mapping to the context pointed to > by each WR individually so that it is unmapped after the WR completes. > > Signed-off-by: Tom Tucker <tom@opengridcomputing.com> > > --- > net/sunrpc/xprtrdma/svc_rdma_sendto.c | 122 ++++++++++++++---------------- > net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +- > 2 files changed, 62 insertions(+), 65 deletions(-) > > diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > index fb82b1b..7cd65b7 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > @@ -64,10 +64,9 @@ > * SGE[sge_count-1] data from xdr->tail. > * > */ > -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, > - struct xdr_buf *xdr, > - struct ib_sge *sge, > - int *sge_count) > +static void xdr_to_sge(struct svcxprt_rdma *xprt, > + struct xdr_buf *xdr, > + struct svc_rdma_req_map *vec) > { > /* Max we need is the length of the XDR / pagesize + one for > * head + one for tail + one for RPCRDMA header > @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, > sge_no = 1; Should the arrays in the svc_rdma_req_map actually be size RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header? Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(), which is called from svc_sendto() immediately after setting: xb = &rqstp->rq_res; xb->len = xb->head[0].iov_len + xb->page_len + xb->tail[0].iov_len; So I think xdr_to_sge() is doing a bunch of unnecessary work to handle the case where xdr->len could be less than that sum? > > /* Head SGE */ > - sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device, > - xdr->head[0].iov_base, > - xdr->head[0].iov_len, > - DMA_TO_DEVICE); > + vec->sge[sge_no].iov_base = xdr->head[0].iov_base; > sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len); > byte_count -= sge_bytes; > - sge[sge_no].length = sge_bytes; > - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; > + vec->sge[sge_no].iov_len = sge_bytes; > sge_no++; > > /* pages SGE */ > @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, > page_bytes = xdr->page_len; > page_off = xdr->page_base; > while (byte_count && page_bytes) { > + vec->sge[sge_no].iov_base = > + page_address(xdr->pages[page_no]) + page_off; > sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off)); > - sge[sge_no].addr = > - ib_dma_map_page(xprt->sc_cm_id->device, > - xdr->pages[page_no], page_off, > - sge_bytes, DMA_TO_DEVICE); > sge_bytes = min(sge_bytes, page_bytes); > byte_count -= sge_bytes; > page_bytes -= sge_bytes; > - sge[sge_no].length = sge_bytes; > - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; > + vec->sge[sge_no].iov_len = sge_bytes; > > sge_no++; > page_no++; > @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, > > /* Tail SGE */ > if (byte_count && xdr->tail[0].iov_len) { > - sge[sge_no].addr = > - ib_dma_map_single(xprt->sc_cm_id->device, > - xdr->tail[0].iov_base, > - xdr->tail[0].iov_len, > - DMA_TO_DEVICE); > + vec->sge[sge_no].iov_base = xdr->tail[0].iov_base; > sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len); > byte_count -= sge_bytes; > - sge[sge_no].length = sge_bytes; > - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; > + vec->sge[sge_no].iov_len = sge_bytes; > sge_no++; > } > > BUG_ON(sge_no > sge_max); > BUG_ON(byte_count != 0); > > - *sge_count = sge_no; > - return sge; > + vec->count = sge_no; > } > > > @@ -143,9 +129,8 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, > static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, > u32 rmr, u64 to, > u32 xdr_off, int write_len, > - struct ib_sge *xdr_sge, int sge_count) > + struct svc_rdma_req_map *vec) > { > - struct svc_rdma_op_ctxt *tmp_sge_ctxt; > struct ib_send_wr write_wr; > struct ib_sge *sge; > int xdr_sge_no; > @@ -156,23 +141,22 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, > struct svc_rdma_op_ctxt *ctxt; > int ret = 0; > > - BUG_ON(sge_count > RPCSVC_MAXPAGES); > + BUG_ON(vec->count > RPCSVC_MAXPAGES); > dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, " > - "write_len=%d, xdr_sge=%p, sge_count=%d\n", > + "write_len=%d, vec->sge=%p, vec->count=%lu\n", > rmr, (unsigned long long)to, xdr_off, > - write_len, xdr_sge, sge_count); > + write_len, vec->sge, vec->count); > > ctxt = svc_rdma_get_context(xprt); > - ctxt->count = 0; > - tmp_sge_ctxt = svc_rdma_get_context(xprt); > - sge = tmp_sge_ctxt->sge; > + ctxt->direction = DMA_TO_DEVICE; > + sge = ctxt->sge; > > /* Find the SGE associated with xdr_off */ > - for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < sge_count; > + for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count; > xdr_sge_no++) { > - if (xdr_sge[xdr_sge_no].length > bc) > + if (vec->sge[xdr_sge_no].iov_len > bc) > break; > - bc -= xdr_sge[xdr_sge_no].length; > + bc -= vec->sge[xdr_sge_no].iov_len; > } > > sge_off = bc; > @@ -180,21 +164,27 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, > sge_no = 0; > > /* Copy the remaining SGE */ > - while (bc != 0 && xdr_sge_no < sge_count) { > - sge[sge_no].addr = xdr_sge[xdr_sge_no].addr + sge_off; > - sge[sge_no].lkey = xdr_sge[xdr_sge_no].lkey; > + while (bc != 0 && xdr_sge_no < vec->count) { > + sge[sge_no].lkey = xprt->sc_phys_mr->lkey; > sge_bytes = min((size_t)bc, > - (size_t)(xdr_sge[xdr_sge_no].length-sge_off)); > + (size_t)(vec->sge[xdr_sge_no].iov_len-sge_off)); > sge[sge_no].length = sge_bytes; > - > + sge[sge_no].addr = > + ib_dma_map_single(xprt->sc_cm_id->device, > + (void *) > + vec->sge[xdr_sge_no].iov_base + sge_off, > + sge_bytes, DMA_TO_DEVICE); > + if (dma_mapping_error(sge[sge_no].addr)) > + return -EINVAL; We leak a svc_rdma_op_ctxt on error here. --b. > sge_off = 0; > sge_no++; > + ctxt->count++; > xdr_sge_no++; > bc -= sge_bytes; > } > > BUG_ON(bc != 0); > - BUG_ON(xdr_sge_no > sge_count); > + BUG_ON(xdr_sge_no > vec->count); > > /* Prepare WRITE WR */ > memset(&write_wr, 0, sizeof write_wr); > @@ -210,11 +200,10 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, > /* Post It */ > atomic_inc(&rdma_stat_write); > if (svc_rdma_send(xprt, &write_wr)) { > - svc_rdma_put_context(ctxt, 1); > + svc_rdma_put_context(ctxt, 0); > /* Fatal error, close transport */ > ret = -EIO; > } > - svc_rdma_put_context(tmp_sge_ctxt, 0); > return ret; > } > > @@ -222,8 +211,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, > struct rpcrdma_msg *rdma_argp, > struct rpcrdma_msg *rdma_resp, > struct svc_rqst *rqstp, > - struct ib_sge *sge, > - int sge_count) > + struct svc_rdma_req_map *vec) > { > u32 xfer_len = rqstp->rq_res.page_len + rqstp->rq_res.tail[0].iov_len; > int write_len; > @@ -269,8 +257,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, > rs_offset + chunk_off, > xdr_off, > this_write, > - sge, > - sge_count); > + vec); > if (ret) { > dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n", > ret); > @@ -292,8 +279,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, > struct rpcrdma_msg *rdma_argp, > struct rpcrdma_msg *rdma_resp, > struct svc_rqst *rqstp, > - struct ib_sge *sge, > - int sge_count) > + struct svc_rdma_req_map *vec) > { > u32 xfer_len = rqstp->rq_res.len; > int write_len; > @@ -341,8 +327,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, > rs_offset + chunk_off, > xdr_off, > this_write, > - sge, > - sge_count); > + vec); > if (ret) { > dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n", > ret); > @@ -380,7 +365,7 @@ static int send_reply(struct svcxprt_rdma *rdma, > struct page *page, > struct rpcrdma_msg *rdma_resp, > struct svc_rdma_op_ctxt *ctxt, > - int sge_count, > + struct svc_rdma_req_map *vec, > int byte_count) > { > struct ib_send_wr send_wr; > @@ -413,10 +398,15 @@ static int send_reply(struct svcxprt_rdma *rdma, > ctxt->sge[0].lkey = rdma->sc_phys_mr->lkey; > > /* Determine how many of our SGE are to be transmitted */ > - for (sge_no = 1; byte_count && sge_no < sge_count; sge_no++) { > - sge_bytes = min((size_t)ctxt->sge[sge_no].length, > - (size_t)byte_count); > + for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) { > + sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count); > byte_count -= sge_bytes; > + ctxt->sge[sge_no].addr = > + ib_dma_map_single(rdma->sc_cm_id->device, > + vec->sge[sge_no].iov_base, > + sge_bytes, DMA_TO_DEVICE); > + ctxt->sge[sge_no].length = sge_bytes; > + ctxt->sge[sge_no].lkey = rdma->sc_phys_mr->lkey; > } > BUG_ON(byte_count != 0); > > @@ -428,8 +418,10 @@ static int send_reply(struct svcxprt_rdma *rdma, > ctxt->pages[page_no+1] = rqstp->rq_respages[page_no]; > ctxt->count++; > rqstp->rq_respages[page_no] = NULL; > + /* If there are more pages than SGE, terminate SGE list */ > + if (page_no+1 >= sge_no) > + ctxt->sge[page_no+1].length = 0; > } > - > BUG_ON(sge_no > rdma->sc_max_sge); > memset(&send_wr, 0, sizeof send_wr); > ctxt->wr_op = IB_WR_SEND; > @@ -473,20 +465,20 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) > enum rpcrdma_proc reply_type; > int ret; > int inline_bytes; > - struct ib_sge *sge; > - int sge_count = 0; > struct page *res_page; > struct svc_rdma_op_ctxt *ctxt; > + struct svc_rdma_req_map *vec; > > dprintk("svcrdma: sending response for rqstp=%p\n", rqstp); > > /* Get the RDMA request header. */ > rdma_argp = xdr_start(&rqstp->rq_arg); > > - /* Build an SGE for the XDR */ > + /* Build an req vec for the XDR */ > ctxt = svc_rdma_get_context(rdma); > ctxt->direction = DMA_TO_DEVICE; > - sge = xdr_to_sge(rdma, &rqstp->rq_res, ctxt->sge, &sge_count); > + vec = svc_rdma_get_req_map(); > + xdr_to_sge(rdma, &rqstp->rq_res, vec); > > inline_bytes = rqstp->rq_res.len; > > @@ -503,7 +495,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) > > /* Send any write-chunk data and build resp write-list */ > ret = send_write_chunks(rdma, rdma_argp, rdma_resp, > - rqstp, sge, sge_count); > + rqstp, vec); > if (ret < 0) { > printk(KERN_ERR "svcrdma: failed to send write chunks, rc=%d\n", > ret); > @@ -513,7 +505,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) > > /* Send any reply-list data and update resp reply-list */ > ret = send_reply_chunks(rdma, rdma_argp, rdma_resp, > - rqstp, sge, sge_count); > + rqstp, vec); > if (ret < 0) { > printk(KERN_ERR "svcrdma: failed to send reply chunks, rc=%d\n", > ret); > @@ -521,11 +513,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) > } > inline_bytes -= ret; > > - ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, sge_count, > + ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec, > inline_bytes); > + svc_rdma_put_req_map(vec); > dprintk("svcrdma: send_reply returns %d\n", ret); > return ret; > error: > + svc_rdma_put_req_map(vec); > svc_rdma_put_context(ctxt, 0); > put_page(res_page); > return ret; > diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c > index ae90758..fc86338 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c > @@ -387,10 +387,13 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt) > > switch (ctxt->wr_op) { > case IB_WR_SEND: > - case IB_WR_RDMA_WRITE: > svc_rdma_put_context(ctxt, 1); > break; > > + case IB_WR_RDMA_WRITE: > + svc_rdma_put_context(ctxt, 0); > + break; > + > case IB_WR_RDMA_READ: > if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) { > struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr; ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing 2008-06-16 21:04 ` [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing J. Bruce Fields @ 2008-06-21 16:26 ` Tom Tucker 2008-06-23 18:21 ` J. Bruce Fields 2008-06-21 16:51 ` Tom Tucker 1 sibling, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-06-21 16:26 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs J. Bruce Fields wrote: > On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote: > >> Use the new svc_rdma_req_map data type for mapping the client side memory >> to the server side memory. Move the DMA mapping to the context pointed to >> by each WR individually so that it is unmapped after the WR completes. >> >> Signed-off-by: Tom Tucker <tom@opengridcomputing.com> >> >> --- >> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 122 ++++++++++++++---------------- >> net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +- >> 2 files changed, 62 insertions(+), 65 deletions(-) >> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >> index fb82b1b..7cd65b7 100644 >> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c >> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >> @@ -64,10 +64,9 @@ >> * SGE[sge_count-1] data from xdr->tail. >> * >> */ >> -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> - struct xdr_buf *xdr, >> - struct ib_sge *sge, >> - int *sge_count) >> +static void xdr_to_sge(struct svcxprt_rdma *xprt, >> + struct xdr_buf *xdr, >> + struct svc_rdma_req_map *vec) >> { >> /* Max we need is the length of the XDR / pagesize + one for >> * head + one for tail + one for RPCRDMA header >> @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> sge_no = 1; >> > > Should the arrays in the svc_rdma_req_map actually be size > RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header? > Yes, I think this is a bug. I'm going to clamp max data in a different way in an upcomeing patch. So this will change in a subsequent patch. > Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(), > which is called from svc_sendto() immediately after setting: > > xb = &rqstp->rq_res; > xb->len = xb->head[0].iov_len + > xb->page_len + > xb->tail[0].iov_len; > > So I think xdr_to_sge() is doing a bunch of unnecessary work to handle > the case where xdr->len could be less than that sum? > > Ok... check me below please. >> >> /* Head SGE */ >> - sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device, >> - xdr->head[0].iov_base, >> - xdr->head[0].iov_len, >> - DMA_TO_DEVICE); >> + vec->sge[sge_no].iov_base = xdr->head[0].iov_base; >> sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len); >> This doesn't need to be min_t. It could be xdr->head[0].iov_len. >> byte_count -= sge_bytes; >> - sge[sge_no].length = sge_bytes; >> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >> + vec->sge[sge_no].iov_len = sge_bytes; >> sge_no++; >> >> /* pages SGE */ >> @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> page_bytes = xdr->page_len; >> page_off = xdr->page_base; >> while (byte_count && page_bytes) { >> + vec->sge[sge_no].iov_base = >> + page_address(xdr->pages[page_no]) + page_off; >> sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off)); >> This is still needed if the xdr terminates with a portion of a page and there is not tail. >> - sge[sge_no].addr = >> - ib_dma_map_page(xprt->sc_cm_id->device, >> - xdr->pages[page_no], page_off, >> - sge_bytes, DMA_TO_DEVICE); >> sge_bytes = min(sge_bytes, page_bytes); >> byte_count -= sge_bytes; >> page_bytes -= sge_bytes; >> - sge[sge_no].length = sge_bytes; >> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >> + vec->sge[sge_no].iov_len = sge_bytes; >> >> sge_no++; >> page_no++; >> @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> >> /* Tail SGE */ >> if (byte_count && xdr->tail[0].iov_len) { >> This is defensive. It could just be byte_count. If byte_count != xdr->tail[0].iov_len, the BUG_ON below will let us know. >> - sge[sge_no].addr = >> - ib_dma_map_single(xprt->sc_cm_id->device, >> - xdr->tail[0].iov_base, >> - xdr->tail[0].iov_len, >> - DMA_TO_DEVICE); >> + vec->sge[sge_no].iov_base = xdr->tail[0].iov_base; >> sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len); >> The min_t isn't needed. It could just be byte_count. >> byte_count -= sge_bytes; >> - sge[sge_no].length = sge_bytes; >> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >> + vec->sge[sge_no].iov_len = sge_bytes; >> sge_no++; >> } >> >> BUG_ON(sge_no > sge_max); >> BUG_ON(byte_count != 0); >> >> - *sge_count = sge_no; >> - return sge; >> + vec->count = sge_no; >> } >> >> >> @@ -143,9 +129,8 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, >> u32 rmr, u64 to, >> u32 xdr_off, int write_len, >> - struct ib_sge *xdr_sge, int sge_count) >> + struct svc_rdma_req_map *vec) >> { >> - struct svc_rdma_op_ctxt *tmp_sge_ctxt; >> struct ib_send_wr write_wr; >> struct ib_sge *sge; >> int xdr_sge_no; >> @@ -156,23 +141,22 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, >> struct svc_rdma_op_ctxt *ctxt; >> int ret = 0; >> >> - BUG_ON(sge_count > RPCSVC_MAXPAGES); >> + BUG_ON(vec->count > RPCSVC_MAXPAGES); >> dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, " >> - "write_len=%d, xdr_sge=%p, sge_count=%d\n", >> + "write_len=%d, vec->sge=%p, vec->count=%lu\n", >> rmr, (unsigned long long)to, xdr_off, >> - write_len, xdr_sge, sge_count); >> + write_len, vec->sge, vec->count); >> >> ctxt = svc_rdma_get_context(xprt); >> - ctxt->count = 0; >> - tmp_sge_ctxt = svc_rdma_get_context(xprt); >> - sge = tmp_sge_ctxt->sge; >> + ctxt->direction = DMA_TO_DEVICE; >> + sge = ctxt->sge; >> >> /* Find the SGE associated with xdr_off */ >> - for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < sge_count; >> + for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count; >> xdr_sge_no++) { >> - if (xdr_sge[xdr_sge_no].length > bc) >> + if (vec->sge[xdr_sge_no].iov_len > bc) >> break; >> - bc -= xdr_sge[xdr_sge_no].length; >> + bc -= vec->sge[xdr_sge_no].iov_len; >> } >> >> sge_off = bc; >> @@ -180,21 +164,27 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, >> sge_no = 0; >> >> /* Copy the remaining SGE */ >> - while (bc != 0 && xdr_sge_no < sge_count) { >> - sge[sge_no].addr = xdr_sge[xdr_sge_no].addr + sge_off; >> - sge[sge_no].lkey = xdr_sge[xdr_sge_no].lkey; >> + while (bc != 0 && xdr_sge_no < vec->count) { >> + sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >> sge_bytes = min((size_t)bc, >> - (size_t)(xdr_sge[xdr_sge_no].length-sge_off)); >> + (size_t)(vec->sge[xdr_sge_no].iov_len-sge_off)); >> sge[sge_no].length = sge_bytes; >> - >> + sge[sge_no].addr = >> + ib_dma_map_single(xprt->sc_cm_id->device, >> + (void *) >> + vec->sge[xdr_sge_no].iov_base + sge_off, >> + sge_bytes, DMA_TO_DEVICE); >> + if (dma_mapping_error(sge[sge_no].addr)) >> + return -EINVAL; >> > > We leak a svc_rdma_op_ctxt on error here. > Ok. > --b. > > >> sge_off = 0; >> sge_no++; >> + ctxt->count++; >> xdr_sge_no++; >> bc -= sge_bytes; >> } >> >> BUG_ON(bc != 0); >> - BUG_ON(xdr_sge_no > sge_count); >> + BUG_ON(xdr_sge_no > vec->count); >> >> /* Prepare WRITE WR */ >> memset(&write_wr, 0, sizeof write_wr); >> @@ -210,11 +200,10 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, >> /* Post It */ >> atomic_inc(&rdma_stat_write); >> if (svc_rdma_send(xprt, &write_wr)) { >> - svc_rdma_put_context(ctxt, 1); >> + svc_rdma_put_context(ctxt, 0); >> /* Fatal error, close transport */ >> ret = -EIO; >> } >> - svc_rdma_put_context(tmp_sge_ctxt, 0); >> return ret; >> } >> >> @@ -222,8 +211,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, >> struct rpcrdma_msg *rdma_argp, >> struct rpcrdma_msg *rdma_resp, >> struct svc_rqst *rqstp, >> - struct ib_sge *sge, >> - int sge_count) >> + struct svc_rdma_req_map *vec) >> { >> u32 xfer_len = rqstp->rq_res.page_len + rqstp->rq_res.tail[0].iov_len; >> int write_len; >> @@ -269,8 +257,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, >> rs_offset + chunk_off, >> xdr_off, >> this_write, >> - sge, >> - sge_count); >> + vec); >> if (ret) { >> dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n", >> ret); >> @@ -292,8 +279,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, >> struct rpcrdma_msg *rdma_argp, >> struct rpcrdma_msg *rdma_resp, >> struct svc_rqst *rqstp, >> - struct ib_sge *sge, >> - int sge_count) >> + struct svc_rdma_req_map *vec) >> { >> u32 xfer_len = rqstp->rq_res.len; >> int write_len; >> @@ -341,8 +327,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, >> rs_offset + chunk_off, >> xdr_off, >> this_write, >> - sge, >> - sge_count); >> + vec); >> if (ret) { >> dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n", >> ret); >> @@ -380,7 +365,7 @@ static int send_reply(struct svcxprt_rdma *rdma, >> struct page *page, >> struct rpcrdma_msg *rdma_resp, >> struct svc_rdma_op_ctxt *ctxt, >> - int sge_count, >> + struct svc_rdma_req_map *vec, >> int byte_count) >> { >> struct ib_send_wr send_wr; >> @@ -413,10 +398,15 @@ static int send_reply(struct svcxprt_rdma *rdma, >> ctxt->sge[0].lkey = rdma->sc_phys_mr->lkey; >> >> /* Determine how many of our SGE are to be transmitted */ >> - for (sge_no = 1; byte_count && sge_no < sge_count; sge_no++) { >> - sge_bytes = min((size_t)ctxt->sge[sge_no].length, >> - (size_t)byte_count); >> + for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) { >> + sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count); >> byte_count -= sge_bytes; >> + ctxt->sge[sge_no].addr = >> + ib_dma_map_single(rdma->sc_cm_id->device, >> + vec->sge[sge_no].iov_base, >> + sge_bytes, DMA_TO_DEVICE); >> + ctxt->sge[sge_no].length = sge_bytes; >> + ctxt->sge[sge_no].lkey = rdma->sc_phys_mr->lkey; >> } >> BUG_ON(byte_count != 0); >> >> @@ -428,8 +418,10 @@ static int send_reply(struct svcxprt_rdma *rdma, >> ctxt->pages[page_no+1] = rqstp->rq_respages[page_no]; >> ctxt->count++; >> rqstp->rq_respages[page_no] = NULL; >> + /* If there are more pages than SGE, terminate SGE list */ >> + if (page_no+1 >= sge_no) >> + ctxt->sge[page_no+1].length = 0; >> } >> - >> BUG_ON(sge_no > rdma->sc_max_sge); >> memset(&send_wr, 0, sizeof send_wr); >> ctxt->wr_op = IB_WR_SEND; >> @@ -473,20 +465,20 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) >> enum rpcrdma_proc reply_type; >> int ret; >> int inline_bytes; >> - struct ib_sge *sge; >> - int sge_count = 0; >> struct page *res_page; >> struct svc_rdma_op_ctxt *ctxt; >> + struct svc_rdma_req_map *vec; >> >> dprintk("svcrdma: sending response for rqstp=%p\n", rqstp); >> >> /* Get the RDMA request header. */ >> rdma_argp = xdr_start(&rqstp->rq_arg); >> >> - /* Build an SGE for the XDR */ >> + /* Build an req vec for the XDR */ >> ctxt = svc_rdma_get_context(rdma); >> ctxt->direction = DMA_TO_DEVICE; >> - sge = xdr_to_sge(rdma, &rqstp->rq_res, ctxt->sge, &sge_count); >> + vec = svc_rdma_get_req_map(); >> + xdr_to_sge(rdma, &rqstp->rq_res, vec); >> >> inline_bytes = rqstp->rq_res.len; >> >> @@ -503,7 +495,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) >> >> /* Send any write-chunk data and build resp write-list */ >> ret = send_write_chunks(rdma, rdma_argp, rdma_resp, >> - rqstp, sge, sge_count); >> + rqstp, vec); >> if (ret < 0) { >> printk(KERN_ERR "svcrdma: failed to send write chunks, rc=%d\n", >> ret); >> @@ -513,7 +505,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) >> >> /* Send any reply-list data and update resp reply-list */ >> ret = send_reply_chunks(rdma, rdma_argp, rdma_resp, >> - rqstp, sge, sge_count); >> + rqstp, vec); >> if (ret < 0) { >> printk(KERN_ERR "svcrdma: failed to send reply chunks, rc=%d\n", >> ret); >> @@ -521,11 +513,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) >> } >> inline_bytes -= ret; >> >> - ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, sge_count, >> + ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec, >> inline_bytes); >> + svc_rdma_put_req_map(vec); >> dprintk("svcrdma: send_reply returns %d\n", ret); >> return ret; >> error: >> + svc_rdma_put_req_map(vec); >> svc_rdma_put_context(ctxt, 0); >> put_page(res_page); >> return ret; >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c >> index ae90758..fc86338 100644 >> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c >> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c >> @@ -387,10 +387,13 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt) >> >> switch (ctxt->wr_op) { >> case IB_WR_SEND: >> - case IB_WR_RDMA_WRITE: >> svc_rdma_put_context(ctxt, 1); >> break; >> >> + case IB_WR_RDMA_WRITE: >> + svc_rdma_put_context(ctxt, 0); >> + break; >> + >> case IB_WR_RDMA_READ: >> if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) { >> struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr; >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing 2008-06-21 16:26 ` Tom Tucker @ 2008-06-23 18:21 ` J. Bruce Fields 2008-06-24 2:29 ` Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: J. Bruce Fields @ 2008-06-23 18:21 UTC (permalink / raw) To: Tom Tucker; +Cc: linux-nfs On Sat, Jun 21, 2008 at 11:26:52AM -0500, Tom Tucker wrote: > J. Bruce Fields wrote: >> Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(), >> which is called from svc_sendto() immediately after setting: >> >> xb = &rqstp->rq_res; >> xb->len = xb->head[0].iov_len + xb->page_len + >> xb->tail[0].iov_len; >> >> So I think xdr_to_sge() is doing a bunch of unnecessary work to handle >> the case where xdr->len could be less than that sum? >> >> > Ok... check me below please. >>> /* Head SGE */ >>> - sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device, >>> - xdr->head[0].iov_base, >>> - xdr->head[0].iov_len, >>> - DMA_TO_DEVICE); >>> + vec->sge[sge_no].iov_base = xdr->head[0].iov_base; >>> sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len); >>> > > This doesn't need to be min_t. It could be xdr->head[0].iov_len. Yes. The variable byte_count shouldn't be needed at all, so I'd start by eliminating all references to byte_count. > >>> byte_count -= sge_bytes; >>> - sge[sge_no].length = sge_bytes; >>> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >>> + vec->sge[sge_no].iov_len = sge_bytes; >>> sge_no++; >>> /* pages SGE */ >>> @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >>> page_bytes = xdr->page_len; >>> page_off = xdr->page_base; >>> while (byte_count && page_bytes) { >>> + vec->sge[sge_no].iov_base = >>> + page_address(xdr->pages[page_no]) + page_off; >>> sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off)); >>> > This is still needed if the xdr terminates with a portion of a page and > there is not tail. Yes, but the condition could just be (page_bytes). >>> - sge[sge_no].addr = >>> - ib_dma_map_page(xprt->sc_cm_id->device, >>> - xdr->pages[page_no], page_off, >>> - sge_bytes, DMA_TO_DEVICE); >>> sge_bytes = min(sge_bytes, page_bytes); >>> byte_count -= sge_bytes; >>> page_bytes -= sge_bytes; >>> - sge[sge_no].length = sge_bytes; >>> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >>> + vec->sge[sge_no].iov_len = sge_bytes; >>> sge_no++; >>> page_no++; >>> @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >>> /* Tail SGE */ >>> if (byte_count && xdr->tail[0].iov_len) { >>> > This is defensive. It could just be byte_count. I agree, though I'd just make it (xdr->tail[0].iov_len) instead of just (byte_count). > If byte_count != xdr->tail[0].iov_len, the BUG_ON below will let us know. >>> - sge[sge_no].addr = >>> - ib_dma_map_single(xprt->sc_cm_id->device, >>> - xdr->tail[0].iov_base, >>> - xdr->tail[0].iov_len, >>> - DMA_TO_DEVICE); >>> + vec->sge[sge_no].iov_base = xdr->tail[0].iov_base; >>> sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len); >>> > The min_t isn't needed. It could just be byte_count. Right, but could also just be tail length, as above. --b. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing 2008-06-23 18:21 ` J. Bruce Fields @ 2008-06-24 2:29 ` Tom Tucker 0 siblings, 0 replies; 20+ messages in thread From: Tom Tucker @ 2008-06-24 2:29 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs J. Bruce Fields wrote: > On Sat, Jun 21, 2008 at 11:26:52AM -0500, Tom Tucker wrote: > >> J. Bruce Fields wrote: >> >>> Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(), >>> which is called from svc_sendto() immediately after setting: >>> >>> xb = &rqstp->rq_res; >>> xb->len = xb->head[0].iov_len + xb->page_len + >>> xb->tail[0].iov_len; >>> >>> So I think xdr_to_sge() is doing a bunch of unnecessary work to handle >>> the case where xdr->len could be less than that sum? >>> >>> >>> >> Ok... check me below please. >> >>>> /* Head SGE */ >>>> - sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device, >>>> - xdr->head[0].iov_base, >>>> - xdr->head[0].iov_len, >>>> - DMA_TO_DEVICE); >>>> + vec->sge[sge_no].iov_base = xdr->head[0].iov_base; >>>> sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len); >>>> >>>> >> This doesn't need to be min_t. It could be xdr->head[0].iov_len. >> > > Yes. The variable byte_count shouldn't be needed at all, so I'd start > by eliminating all references to byte_count. > > >>>> byte_count -= sge_bytes; >>>> - sge[sge_no].length = sge_bytes; >>>> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >>>> + vec->sge[sge_no].iov_len = sge_bytes; >>>> sge_no++; >>>> /* pages SGE */ >>>> @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >>>> page_bytes = xdr->page_len; >>>> page_off = xdr->page_base; >>>> while (byte_count && page_bytes) { >>>> + vec->sge[sge_no].iov_base = >>>> + page_address(xdr->pages[page_no]) + page_off; >>>> sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off)); >>>> >>>> >> This is still needed if the xdr terminates with a portion of a page and >> there is not tail. >> > > Yes, but the condition could just be (page_bytes). > > Right. >>>> - sge[sge_no].addr = >>>> - ib_dma_map_page(xprt->sc_cm_id->device, >>>> - xdr->pages[page_no], page_off, >>>> - sge_bytes, DMA_TO_DEVICE); >>>> sge_bytes = min(sge_bytes, page_bytes); >>>> byte_count -= sge_bytes; >>>> page_bytes -= sge_bytes; >>>> - sge[sge_no].length = sge_bytes; >>>> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >>>> + vec->sge[sge_no].iov_len = sge_bytes; >>>> sge_no++; >>>> page_no++; >>>> @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >>>> /* Tail SGE */ >>>> if (byte_count && xdr->tail[0].iov_len) { >>>> >>>> >> This is defensive. It could just be byte_count. >> > > I agree, though I'd just make it (xdr->tail[0].iov_len) instead of just > (byte_count). > > Agreed. >> If byte_count != xdr->tail[0].iov_len, the BUG_ON below will let us know. >> >>>> - sge[sge_no].addr = >>>> - ib_dma_map_single(xprt->sc_cm_id->device, >>>> - xdr->tail[0].iov_base, >>>> - xdr->tail[0].iov_len, >>>> - DMA_TO_DEVICE); >>>> + vec->sge[sge_no].iov_base = xdr->tail[0].iov_base; >>>> sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len); >>>> >>>> >> The min_t isn't needed. It could just be byte_count. >> > > Right, but could also just be tail length, as above. > > Ok, page # confirmed... > --b. > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing 2008-06-16 21:04 ` [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing J. Bruce Fields 2008-06-21 16:26 ` Tom Tucker @ 2008-06-21 16:51 ` Tom Tucker 2008-06-23 18:51 ` J. Bruce Fields 1 sibling, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-06-21 16:51 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs J. Bruce Fields wrote: > On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote: > >> Use the new svc_rdma_req_map data type for mapping the client side memory >> to the server side memory. Move the DMA mapping to the context pointed to >> by each WR individually so that it is unmapped after the WR completes. >> >> Signed-off-by: Tom Tucker <tom@opengridcomputing.com> >> >> --- >> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 122 ++++++++++++++---------------- >> net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +- >> 2 files changed, 62 insertions(+), 65 deletions(-) >> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >> index fb82b1b..7cd65b7 100644 >> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c >> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >> @@ -64,10 +64,9 @@ >> * SGE[sge_count-1] data from xdr->tail. >> * >> */ >> -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> - struct xdr_buf *xdr, >> - struct ib_sge *sge, >> - int *sge_count) >> +static void xdr_to_sge(struct svcxprt_rdma *xprt, >> + struct xdr_buf *xdr, >> + struct svc_rdma_req_map *vec) >> { >> /* Max we need is the length of the XDR / pagesize + one for >> * head + one for tail + one for RPCRDMA header >> @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> sge_no = 1; >> > > Should the arrays in the svc_rdma_req_map actually be size > RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header? > > Actually, upon further inspection, this is not a bug. The RPCSVC_MAXPAGES is used to size the rqstp page array. This array includes a page for the request hdr, reply hdr and one extra page for alignment. For our use case, the reply hdr page covers the RDMA header. However, I think this payload length should be clamped in a different way. It seems silly to allocate these things to handle big-data mounts that are rarely used in practice. I think it's better to have a transport max that is more reasonable by default and a module parameter that is configurable if you're looking to do big-data mounts. Thoughts? > Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(), > which is called from svc_sendto() immediately after setting: > > xb = &rqstp->rq_res; > xb->len = xb->head[0].iov_len + > xb->page_len + > xb->tail[0].iov_len; > > So I think xdr_to_sge() is doing a bunch of unnecessary work to handle > the case where xdr->len could be less than that sum? > > >> >> /* Head SGE */ >> - sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device, >> - xdr->head[0].iov_base, >> - xdr->head[0].iov_len, >> - DMA_TO_DEVICE); >> + vec->sge[sge_no].iov_base = xdr->head[0].iov_base; >> sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len); >> byte_count -= sge_bytes; >> - sge[sge_no].length = sge_bytes; >> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >> + vec->sge[sge_no].iov_len = sge_bytes; >> sge_no++; >> >> /* pages SGE */ >> @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> page_bytes = xdr->page_len; >> page_off = xdr->page_base; >> while (byte_count && page_bytes) { >> + vec->sge[sge_no].iov_base = >> + page_address(xdr->pages[page_no]) + page_off; >> sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off)); >> - sge[sge_no].addr = >> - ib_dma_map_page(xprt->sc_cm_id->device, >> - xdr->pages[page_no], page_off, >> - sge_bytes, DMA_TO_DEVICE); >> sge_bytes = min(sge_bytes, page_bytes); >> byte_count -= sge_bytes; >> page_bytes -= sge_bytes; >> - sge[sge_no].length = sge_bytes; >> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >> + vec->sge[sge_no].iov_len = sge_bytes; >> >> sge_no++; >> page_no++; >> @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> >> /* Tail SGE */ >> if (byte_count && xdr->tail[0].iov_len) { >> - sge[sge_no].addr = >> - ib_dma_map_single(xprt->sc_cm_id->device, >> - xdr->tail[0].iov_base, >> - xdr->tail[0].iov_len, >> - DMA_TO_DEVICE); >> + vec->sge[sge_no].iov_base = xdr->tail[0].iov_base; >> sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len); >> byte_count -= sge_bytes; >> - sge[sge_no].length = sge_bytes; >> - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >> + vec->sge[sge_no].iov_len = sge_bytes; >> sge_no++; >> } >> >> BUG_ON(sge_no > sge_max); >> BUG_ON(byte_count != 0); >> >> - *sge_count = sge_no; >> - return sge; >> + vec->count = sge_no; >> } >> >> >> @@ -143,9 +129,8 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >> static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, >> u32 rmr, u64 to, >> u32 xdr_off, int write_len, >> - struct ib_sge *xdr_sge, int sge_count) >> + struct svc_rdma_req_map *vec) >> { >> - struct svc_rdma_op_ctxt *tmp_sge_ctxt; >> struct ib_send_wr write_wr; >> struct ib_sge *sge; >> int xdr_sge_no; >> @@ -156,23 +141,22 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, >> struct svc_rdma_op_ctxt *ctxt; >> int ret = 0; >> >> - BUG_ON(sge_count > RPCSVC_MAXPAGES); >> + BUG_ON(vec->count > RPCSVC_MAXPAGES); >> dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, " >> - "write_len=%d, xdr_sge=%p, sge_count=%d\n", >> + "write_len=%d, vec->sge=%p, vec->count=%lu\n", >> rmr, (unsigned long long)to, xdr_off, >> - write_len, xdr_sge, sge_count); >> + write_len, vec->sge, vec->count); >> >> ctxt = svc_rdma_get_context(xprt); >> - ctxt->count = 0; >> - tmp_sge_ctxt = svc_rdma_get_context(xprt); >> - sge = tmp_sge_ctxt->sge; >> + ctxt->direction = DMA_TO_DEVICE; >> + sge = ctxt->sge; >> >> /* Find the SGE associated with xdr_off */ >> - for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < sge_count; >> + for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count; >> xdr_sge_no++) { >> - if (xdr_sge[xdr_sge_no].length > bc) >> + if (vec->sge[xdr_sge_no].iov_len > bc) >> break; >> - bc -= xdr_sge[xdr_sge_no].length; >> + bc -= vec->sge[xdr_sge_no].iov_len; >> } >> >> sge_off = bc; >> @@ -180,21 +164,27 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, >> sge_no = 0; >> >> /* Copy the remaining SGE */ >> - while (bc != 0 && xdr_sge_no < sge_count) { >> - sge[sge_no].addr = xdr_sge[xdr_sge_no].addr + sge_off; >> - sge[sge_no].lkey = xdr_sge[xdr_sge_no].lkey; >> + while (bc != 0 && xdr_sge_no < vec->count) { >> + sge[sge_no].lkey = xprt->sc_phys_mr->lkey; >> sge_bytes = min((size_t)bc, >> - (size_t)(xdr_sge[xdr_sge_no].length-sge_off)); >> + (size_t)(vec->sge[xdr_sge_no].iov_len-sge_off)); >> sge[sge_no].length = sge_bytes; >> - >> + sge[sge_no].addr = >> + ib_dma_map_single(xprt->sc_cm_id->device, >> + (void *) >> + vec->sge[xdr_sge_no].iov_base + sge_off, >> + sge_bytes, DMA_TO_DEVICE); >> + if (dma_mapping_error(sge[sge_no].addr)) >> + return -EINVAL; >> > > We leak a svc_rdma_op_ctxt on error here. > > --b. > > >> sge_off = 0; >> sge_no++; >> + ctxt->count++; >> xdr_sge_no++; >> bc -= sge_bytes; >> } >> >> BUG_ON(bc != 0); >> - BUG_ON(xdr_sge_no > sge_count); >> + BUG_ON(xdr_sge_no > vec->count); >> >> /* Prepare WRITE WR */ >> memset(&write_wr, 0, sizeof write_wr); >> @@ -210,11 +200,10 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, >> /* Post It */ >> atomic_inc(&rdma_stat_write); >> if (svc_rdma_send(xprt, &write_wr)) { >> - svc_rdma_put_context(ctxt, 1); >> + svc_rdma_put_context(ctxt, 0); >> /* Fatal error, close transport */ >> ret = -EIO; >> } >> - svc_rdma_put_context(tmp_sge_ctxt, 0); >> return ret; >> } >> >> @@ -222,8 +211,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, >> struct rpcrdma_msg *rdma_argp, >> struct rpcrdma_msg *rdma_resp, >> struct svc_rqst *rqstp, >> - struct ib_sge *sge, >> - int sge_count) >> + struct svc_rdma_req_map *vec) >> { >> u32 xfer_len = rqstp->rq_res.page_len + rqstp->rq_res.tail[0].iov_len; >> int write_len; >> @@ -269,8 +257,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, >> rs_offset + chunk_off, >> xdr_off, >> this_write, >> - sge, >> - sge_count); >> + vec); >> if (ret) { >> dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n", >> ret); >> @@ -292,8 +279,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, >> struct rpcrdma_msg *rdma_argp, >> struct rpcrdma_msg *rdma_resp, >> struct svc_rqst *rqstp, >> - struct ib_sge *sge, >> - int sge_count) >> + struct svc_rdma_req_map *vec) >> { >> u32 xfer_len = rqstp->rq_res.len; >> int write_len; >> @@ -341,8 +327,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, >> rs_offset + chunk_off, >> xdr_off, >> this_write, >> - sge, >> - sge_count); >> + vec); >> if (ret) { >> dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n", >> ret); >> @@ -380,7 +365,7 @@ static int send_reply(struct svcxprt_rdma *rdma, >> struct page *page, >> struct rpcrdma_msg *rdma_resp, >> struct svc_rdma_op_ctxt *ctxt, >> - int sge_count, >> + struct svc_rdma_req_map *vec, >> int byte_count) >> { >> struct ib_send_wr send_wr; >> @@ -413,10 +398,15 @@ static int send_reply(struct svcxprt_rdma *rdma, >> ctxt->sge[0].lkey = rdma->sc_phys_mr->lkey; >> >> /* Determine how many of our SGE are to be transmitted */ >> - for (sge_no = 1; byte_count && sge_no < sge_count; sge_no++) { >> - sge_bytes = min((size_t)ctxt->sge[sge_no].length, >> - (size_t)byte_count); >> + for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) { >> + sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count); >> byte_count -= sge_bytes; >> + ctxt->sge[sge_no].addr = >> + ib_dma_map_single(rdma->sc_cm_id->device, >> + vec->sge[sge_no].iov_base, >> + sge_bytes, DMA_TO_DEVICE); >> + ctxt->sge[sge_no].length = sge_bytes; >> + ctxt->sge[sge_no].lkey = rdma->sc_phys_mr->lkey; >> } >> BUG_ON(byte_count != 0); >> >> @@ -428,8 +418,10 @@ static int send_reply(struct svcxprt_rdma *rdma, >> ctxt->pages[page_no+1] = rqstp->rq_respages[page_no]; >> ctxt->count++; >> rqstp->rq_respages[page_no] = NULL; >> + /* If there are more pages than SGE, terminate SGE list */ >> + if (page_no+1 >= sge_no) >> + ctxt->sge[page_no+1].length = 0; >> } >> - >> BUG_ON(sge_no > rdma->sc_max_sge); >> memset(&send_wr, 0, sizeof send_wr); >> ctxt->wr_op = IB_WR_SEND; >> @@ -473,20 +465,20 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) >> enum rpcrdma_proc reply_type; >> int ret; >> int inline_bytes; >> - struct ib_sge *sge; >> - int sge_count = 0; >> struct page *res_page; >> struct svc_rdma_op_ctxt *ctxt; >> + struct svc_rdma_req_map *vec; >> >> dprintk("svcrdma: sending response for rqstp=%p\n", rqstp); >> >> /* Get the RDMA request header. */ >> rdma_argp = xdr_start(&rqstp->rq_arg); >> >> - /* Build an SGE for the XDR */ >> + /* Build an req vec for the XDR */ >> ctxt = svc_rdma_get_context(rdma); >> ctxt->direction = DMA_TO_DEVICE; >> - sge = xdr_to_sge(rdma, &rqstp->rq_res, ctxt->sge, &sge_count); >> + vec = svc_rdma_get_req_map(); >> + xdr_to_sge(rdma, &rqstp->rq_res, vec); >> >> inline_bytes = rqstp->rq_res.len; >> >> @@ -503,7 +495,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) >> >> /* Send any write-chunk data and build resp write-list */ >> ret = send_write_chunks(rdma, rdma_argp, rdma_resp, >> - rqstp, sge, sge_count); >> + rqstp, vec); >> if (ret < 0) { >> printk(KERN_ERR "svcrdma: failed to send write chunks, rc=%d\n", >> ret); >> @@ -513,7 +505,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) >> >> /* Send any reply-list data and update resp reply-list */ >> ret = send_reply_chunks(rdma, rdma_argp, rdma_resp, >> - rqstp, sge, sge_count); >> + rqstp, vec); >> if (ret < 0) { >> printk(KERN_ERR "svcrdma: failed to send reply chunks, rc=%d\n", >> ret); >> @@ -521,11 +513,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) >> } >> inline_bytes -= ret; >> >> - ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, sge_count, >> + ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec, >> inline_bytes); >> + svc_rdma_put_req_map(vec); >> dprintk("svcrdma: send_reply returns %d\n", ret); >> return ret; >> error: >> + svc_rdma_put_req_map(vec); >> svc_rdma_put_context(ctxt, 0); >> put_page(res_page); >> return ret; >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c >> index ae90758..fc86338 100644 >> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c >> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c >> @@ -387,10 +387,13 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt) >> >> switch (ctxt->wr_op) { >> case IB_WR_SEND: >> - case IB_WR_RDMA_WRITE: >> svc_rdma_put_context(ctxt, 1); >> break; >> >> + case IB_WR_RDMA_WRITE: >> + svc_rdma_put_context(ctxt, 0); >> + break; >> + >> case IB_WR_RDMA_READ: >> if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) { >> struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr; >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing 2008-06-21 16:51 ` Tom Tucker @ 2008-06-23 18:51 ` J. Bruce Fields 2008-06-24 3:02 ` Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: J. Bruce Fields @ 2008-06-23 18:51 UTC (permalink / raw) To: Tom Tucker; +Cc: linux-nfs On Sat, Jun 21, 2008 at 11:51:18AM -0500, Tom Tucker wrote: > J. Bruce Fields wrote: >> On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote: >> >>> Use the new svc_rdma_req_map data type for mapping the client side memory >>> to the server side memory. Move the DMA mapping to the context pointed to >>> by each WR individually so that it is unmapped after the WR completes. >>> >>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com> >>> >>> --- >>> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 122 ++++++++++++++---------------- >>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +- >>> 2 files changed, 62 insertions(+), 65 deletions(-) >>> >>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >>> index fb82b1b..7cd65b7 100644 >>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c >>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >>> @@ -64,10 +64,9 @@ >>> * SGE[sge_count-1] data from xdr->tail. >>> * >>> */ >>> -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >>> - struct xdr_buf *xdr, >>> - struct ib_sge *sge, >>> - int *sge_count) >>> +static void xdr_to_sge(struct svcxprt_rdma *xprt, >>> + struct xdr_buf *xdr, >>> + struct svc_rdma_req_map *vec) >>> { >>> /* Max we need is the length of the XDR / pagesize + one for >>> * head + one for tail + one for RPCRDMA header >>> @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >>> sge_no = 1; >>> >> >> Should the arrays in the svc_rdma_req_map actually be size >> RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header? >> >> > Actually, upon further inspection, this is not a bug. The > RPCSVC_MAXPAGES is used to size the rqstp > page array. This array includes a page for the request hdr, reply hdr > and one extra page for alignment. For > our use case, the reply hdr page covers the RDMA header. OK, that makes sense. (Let's just fix the comment above so I don't forget and ask the same question again next time.) > However, I think this payload length should be clamped in a different > way. It seems silly to allocate these > things to handle big-data mounts that are rarely used in practice. I > think it's better to have a transport max > that is more reasonable by default and a module parameter that is > configurable if you're looking to do big-data > mounts. > > Thoughts? This is decided in fs/nfsd/nfssvc.c:nfsd_create_serv(), based on the machine's ram size, and can be overridden with the nfsd filesystem's maxblksize parameter. I'm open to suggestions, but my feeling is that it's hard to educate people about how to use additional sysctl's, and that it's reasonable to waste some memory to get good out-of-the-box performance on streaming IO. See also recent patches from Olga and Dean for similar discussion (about WAN tcp performance in their case, since the size of these buffers affects the size of the tcp window we advertise). --b. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing 2008-06-23 18:51 ` J. Bruce Fields @ 2008-06-24 3:02 ` Tom Tucker 0 siblings, 0 replies; 20+ messages in thread From: Tom Tucker @ 2008-06-24 3:02 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs J. Bruce Fields wrote: > On Sat, Jun 21, 2008 at 11:51:18AM -0500, Tom Tucker wrote: > >> J. Bruce Fields wrote: >> >>> On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote: >>> >>> >>>> Use the new svc_rdma_req_map data type for mapping the client side memory >>>> to the server side memory. Move the DMA mapping to the context pointed to >>>> by each WR individually so that it is unmapped after the WR completes. >>>> >>>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com> >>>> >>>> --- >>>> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 122 ++++++++++++++---------------- >>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +- >>>> 2 files changed, 62 insertions(+), 65 deletions(-) >>>> >>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >>>> index fb82b1b..7cd65b7 100644 >>>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c >>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >>>> @@ -64,10 +64,9 @@ >>>> * SGE[sge_count-1] data from xdr->tail. >>>> * >>>> */ >>>> -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >>>> - struct xdr_buf *xdr, >>>> - struct ib_sge *sge, >>>> - int *sge_count) >>>> +static void xdr_to_sge(struct svcxprt_rdma *xprt, >>>> + struct xdr_buf *xdr, >>>> + struct svc_rdma_req_map *vec) >>>> { >>>> /* Max we need is the length of the XDR / pagesize + one for >>>> * head + one for tail + one for RPCRDMA header >>>> @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, >>>> sge_no = 1; >>>> >>>> >>> Should the arrays in the svc_rdma_req_map actually be size >>> RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header? >>> >>> >>> >> Actually, upon further inspection, this is not a bug. The >> RPCSVC_MAXPAGES is used to size the rqstp >> page array. This array includes a page for the request hdr, reply hdr >> and one extra page for alignment. For >> our use case, the reply hdr page covers the RDMA header. >> > > OK, that makes sense. (Let's just fix the comment above so I don't > forget and ask the same question again next time.) > > Ok, done. >> However, I think this payload length should be clamped in a different >> way. It seems silly to allocate these >> things to handle big-data mounts that are rarely used in practice. I >> think it's better to have a transport max >> that is more reasonable by default and a module parameter that is >> configurable if you're looking to do big-data >> mounts. >> >> Thoughts? >> > > This is decided in fs/nfsd/nfssvc.c:nfsd_create_serv(), based on the > machine's ram size, and can be overridden with the nfsd filesystem's > maxblksize parameter. > > I'm open to suggestions, but my feeling is that it's hard to educate > people about how to use additional sysctl's, and that it's reasonable to > waste some memory to get good out-of-the-box performance on streaming > IO. > > See also recent patches from Olga and Dean for similar discussion > (about WAN tcp performance in their case, since the size of these > buffers affects the size of the tcp window we advertise). > > --b. > ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <1212083697950-git-send-email-tom@opengridcomputing.com>]
[parent not found: <1212083697236-git-send-email-tom@opengridcomputing.com>]
[parent not found: <12120836973390-git-send-email-tom@opengridcomputing.com>]
[parent not found: <12120836973638-git-send-email-tom@opengridcomputing.com>]
[parent not found: <12120836973072-git-send-email-tom@opengridcomputing.com>]
[parent not found: <12120836972503-git-send-email-tom@opengridcomputing.com>]
[parent not found: <12120836973166-git-send-email-tom@opengridcomputing.com>]
[parent not found: <12120836972648-git-send-email-tom@opengridcomputing.com>]
* Re: [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts [not found] ` <12120836972648-git-send-email-tom@opengridcomputing.com> @ 2008-06-16 21:24 ` J. Bruce Fields 2008-06-21 17:08 ` Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: J. Bruce Fields @ 2008-06-16 21:24 UTC (permalink / raw) To: Tom Tucker; +Cc: linux-nfs On Thu, May 29, 2008 at 12:54:55PM -0500, Tom Tucker wrote: > Create a kmem cache to hold WR contexts. Next we will convert > the WR context get and put services to use this kmem cache. > > Signed-off-by: Tom Tucker <tom@opengridcomputing.com> > > --- > net/sunrpc/xprtrdma/svc_rdma.c | 24 ++++++++++++++++++++---- > 1 files changed, 20 insertions(+), 4 deletions(-) > > diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c > index 1959a9d..8a7d34b 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma.c > +++ b/net/sunrpc/xprtrdma/svc_rdma.c > @@ -69,8 +69,9 @@ atomic_t rdma_stat_rq_prod; > atomic_t rdma_stat_sq_poll; > atomic_t rdma_stat_sq_prod; > > -/* Temporary NFS request map cache */ > +/* Temporary NFS request map and context caches */ > struct kmem_cache *svc_rdma_map_cachep = NULL; > +struct kmem_cache *svc_rdma_ctxt_cachep = NULL; No need to initialize this one either. > > /* > * This function implements reading and resetting an atomic_t stat > @@ -247,6 +248,8 @@ void svc_rdma_cleanup(void) > svc_unreg_xprt_class(&svc_rdma_class); > if (svc_rdma_map_cachep) > kmem_cache_destroy(svc_rdma_map_cachep); > + if (svc_rdma_ctxt_cachep) > + kmem_cache_destroy(svc_rdma_ctxt_cachep); And the conditional is probably unnecessary again. --b. > } > > int svc_rdma_init(void) > @@ -269,14 +272,27 @@ int svc_rdma_init(void) > NULL); > if (!svc_rdma_map_cachep) { > printk(KERN_INFO "Could not allocate map cache.\n"); > - goto err; > + goto err0; > + } > + > + /* Create the temporary context cache */ > + svc_rdma_ctxt_cachep = > + kmem_cache_create("svc_rdma_ctxt_cache", > + sizeof(struct svc_rdma_op_ctxt), > + 0, > + SLAB_HWCACHE_ALIGN, > + NULL); > + if (!svc_rdma_ctxt_cachep) { > + printk(KERN_INFO "Could not allocate WR ctxt cache.\n"); > + goto err1; > } > > /* Register RDMA with the SVC transport switch */ > svc_reg_xprt_class(&svc_rdma_class); > return 0; > - > - err: > + err1: > + kmem_cache_destroy(svc_rdma_map_cachep); > + err0: > if (svcrdma_table_header) { > unregister_sysctl_table(svcrdma_table_header); > svcrdma_table_header = NULL; ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts 2008-06-16 21:24 ` [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts J. Bruce Fields @ 2008-06-21 17:08 ` Tom Tucker 0 siblings, 0 replies; 20+ messages in thread From: Tom Tucker @ 2008-06-21 17:08 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs J. Bruce Fields wrote: > On Thu, May 29, 2008 at 12:54:55PM -0500, Tom Tucker wrote: > >> Create a kmem cache to hold WR contexts. Next we will convert >> the WR context get and put services to use this kmem cache. >> >> Signed-off-by: Tom Tucker <tom@opengridcomputing.com> >> >> --- >> net/sunrpc/xprtrdma/svc_rdma.c | 24 ++++++++++++++++++++---- >> 1 files changed, 20 insertions(+), 4 deletions(-) >> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c >> index 1959a9d..8a7d34b 100644 >> --- a/net/sunrpc/xprtrdma/svc_rdma.c >> +++ b/net/sunrpc/xprtrdma/svc_rdma.c >> @@ -69,8 +69,9 @@ atomic_t rdma_stat_rq_prod; >> atomic_t rdma_stat_sq_poll; >> atomic_t rdma_stat_sq_prod; >> >> -/* Temporary NFS request map cache */ >> +/* Temporary NFS request map and context caches */ >> struct kmem_cache *svc_rdma_map_cachep = NULL; >> +struct kmem_cache *svc_rdma_ctxt_cachep = NULL; >> > > No need to initialize this one either. > > Given the way it's used, agreed. >> >> /* >> * This function implements reading and resetting an atomic_t stat >> @@ -247,6 +248,8 @@ void svc_rdma_cleanup(void) >> svc_unreg_xprt_class(&svc_rdma_class); >> if (svc_rdma_map_cachep) >> kmem_cache_destroy(svc_rdma_map_cachep); >> + if (svc_rdma_ctxt_cachep) >> + kmem_cache_destroy(svc_rdma_ctxt_cachep); >> > > And the conditional is probably unnecessary again. > > ok. > --b. > > >> } >> >> int svc_rdma_init(void) >> @@ -269,14 +272,27 @@ int svc_rdma_init(void) >> NULL); >> if (!svc_rdma_map_cachep) { >> printk(KERN_INFO "Could not allocate map cache.\n"); >> - goto err; >> + goto err0; >> + } >> + >> + /* Create the temporary context cache */ >> + svc_rdma_ctxt_cachep = >> + kmem_cache_create("svc_rdma_ctxt_cache", >> + sizeof(struct svc_rdma_op_ctxt), >> + 0, >> + SLAB_HWCACHE_ALIGN, >> + NULL); >> + if (!svc_rdma_ctxt_cachep) { >> + printk(KERN_INFO "Could not allocate WR ctxt cache.\n"); >> + goto err1; >> } >> >> /* Register RDMA with the SVC transport switch */ >> svc_reg_xprt_class(&svc_rdma_class); >> return 0; >> - >> - err: >> + err1: >> + kmem_cache_destroy(svc_rdma_map_cachep); >> + err0: >> if (svcrdma_table_header) { >> unregister_sysctl_table(svcrdma_table_header); >> svcrdma_table_header = NULL; >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 00/11] svcrdma: WR context management bug fixes and cleanup @ 2008-07-03 2:27 Tom Tucker 2008-07-03 2:27 ` [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker Bruce: This is version 2 of the WR context cleanup patchset. It includes modifications per your review. It has been tested with iozone and Connectathon over IB and iWARP on x86_64. The x86_32 platform has only been compile tested for warnings. This set of patches fixes bugs related to DMA mapping and unmapping and in the process redesigns and cleans up the way that WR contexts are managed. In addition to the bug fixes, there are three major changes: - The data structure that maps client side memory to server side memory has been separated from the WR context. This makes for cleaner type checking and a smaller memory footprint for the RDMA transport. - The mapping and unmapping of DMA is performed independently from the allocation and deallocation of the WR context. This is to allow the unmapping to occur early on recv where the unmapping of the DMA needs to be done prior to looking at the data and therefore before deallocating the WR context. - The WR context cache now uses a kmem_cache and is shared across all transport instances. This allows contexts to be shared between mounts and idle mounts don't consume context memory. This patchset is intended for 2.6.27 and is based on 2.6.26-rc8. These patches are also available at the following git url: git://git.linux-nfs.org/projects/tomtucker/xprt-switch-2.6.git [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++ net/sunrpc/xprtrdma/svc_rdma.c | 19 +++++++++++++++++++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++ 3 files changed, 72 insertions(+), 0 deletions(-) [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing net/sunrpc/xprtrdma/svc_rdma_sendto.c | 163 ++++++++++++++---------------- net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +- 2 files changed, 80 insertions(+), 88 deletions(-) [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing include/linux/sunrpc/svc_rdma.h | 1 + net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 83 ++++++++++++++----------------- 2 files changed, 39 insertions(+), 45 deletions(-) [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler net/sunrpc/xprtrdma/svc_rdma_transport.c | 20 ++++++++++++++------ 1 files changed, 14 insertions(+), 6 deletions(-) [PATCH 05/11] svcrdma: Add dma map count and WARN_ON include/linux/sunrpc/svc_rdma.h | 1 + net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 + net/sunrpc/xprtrdma/svc_rdma_sendto.c | 3 +++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +++++ 4 files changed, 10 insertions(+), 0 deletions(-) [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 ---- 1 files changed, 0 insertions(+), 4 deletions(-) [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure include/linux/sunrpc/svc_rdma.h | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD net/sunrpc/xprtrdma/svc_rdma_transport.c | 16 ++++++++++++---- 1 files changed, 12 insertions(+), 4 deletions(-) [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function net/sunrpc/xprtrdma/svc_rdma.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++---- 1 files changed, 19 insertions(+), 4 deletions(-) [PATCH 11/11] svcrdma: Change WR context get/put to use the kmem cache include/linux/sunrpc/svc_rdma.h | 6 -- net/sunrpc/xprtrdma/svc_rdma_transport.c | 121 +++--------------------------- 2 files changed, 12 insertions(+), 115 deletions(-) Signed-off-by: Tom Tucker <tom@opengridcomputing.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping 2008-07-03 2:27 [PATCH 00/11] svcrdma: WR context management bug fixes and cleanup Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 2008-07-03 2:27 ` [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker Create a new data structure to hold the remote client address space to local server address space mapping. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- include/linux/sunrpc/svc_rdma.h | 27 +++++++++++++++++++++++++++ net/sunrpc/xprtrdma/svc_rdma.c | 19 +++++++++++++++++++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 26 ++++++++++++++++++++++++++ 3 files changed, 72 insertions(+), 0 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 05eb466..bd8749c 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt { struct page *pages[RPCSVC_MAXPAGES]; }; +/* + * NFS_ requests are mapped on the client side by the chunk lists in + * the RPCRDMA header. During the fetching of the RPC from the client + * and the writing of the reply to the client, the memory in the + * client and the memory in the server must be mapped as contiguous + * vaddr/len for access by the hardware. These data strucures keep + * these mappings. + * + * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the + * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the + * 'ch' field maps the read-list of the RPCRDMA header to the 'sge' + * mapping of the reply. + */ +struct svc_rdma_chunk_sge { + int start; /* sge no for this chunk */ + int count; /* sge count for this chunk */ +}; +struct svc_rdma_req_map { + unsigned long count; + union { + struct kvec sge[RPCSVC_MAXPAGES]; + struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES]; + }; +}; + #define RDMACTXT_F_LAST_CTXT 2 struct svcxprt_rdma { @@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *); extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *); extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *); extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int); +extern struct svc_rdma_req_map *svc_rdma_get_req_map(void); +extern void svc_rdma_put_req_map(struct svc_rdma_req_map *); extern void svc_sq_reap(struct svcxprt_rdma *); extern void svc_rq_reap(struct svcxprt_rdma *); extern struct svc_xprt_class svc_rdma_class; diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c index 88c0ca2..171f205 100644 --- a/net/sunrpc/xprtrdma/svc_rdma.c +++ b/net/sunrpc/xprtrdma/svc_rdma.c @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod; atomic_t rdma_stat_sq_poll; atomic_t rdma_stat_sq_prod; +/* Temporary NFS request map cache */ +struct kmem_cache *svc_rdma_map_cachep; + /* * This function implements reading and resetting an atomic_t stat * variable through read/write to a proc file. Any write to the file @@ -241,6 +244,7 @@ void svc_rdma_cleanup(void) svcrdma_table_header = NULL; } svc_unreg_xprt_class(&svc_rdma_class); + kmem_cache_destroy(svc_rdma_map_cachep); } int svc_rdma_init(void) @@ -255,9 +259,24 @@ int svc_rdma_init(void) svcrdma_table_header = register_sysctl_table(svcrdma_root_table); + /* Create the temporary map cache */ + svc_rdma_map_cachep = kmem_cache_create("svc_rdma_map_cache", + sizeof(struct svc_rdma_req_map), + 0, + SLAB_HWCACHE_ALIGN, + NULL); + if (!svc_rdma_map_cachep) { + printk(KERN_INFO "Could not allocate map cache.\n"); + goto err; + } + /* Register RDMA with the SVC transport switch */ svc_reg_xprt_class(&svc_rdma_class); return 0; + + err: + unregister_sysctl_table(svcrdma_table_header); + return -ENOMEM; } MODULE_AUTHOR("Tom Tucker <tom@opengridcomputing.com>"); MODULE_DESCRIPTION("SVC RDMA Transport"); diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index e132509..ae90758 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -173,6 +173,32 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages) atomic_dec(&xprt->sc_ctxt_used); } +/* Temporary NFS request map cache. Created in svc_rdma.c */ +extern struct kmem_cache *svc_rdma_map_cachep; + +/* + * Temporary NFS req mappings are shared across all transport + * instances. These are short lived and should be bounded by the number + * of concurrent server threads * depth of the SQ. + */ +struct svc_rdma_req_map *svc_rdma_get_req_map(void) +{ + struct svc_rdma_req_map *map; + while (1) { + map = kmem_cache_alloc(svc_rdma_map_cachep, GFP_KERNEL); + if (map) + break; + schedule_timeout_uninterruptible(msecs_to_jiffies(500)); + } + map->count = 0; + return map; +} + +void svc_rdma_put_req_map(struct svc_rdma_req_map *map) +{ + kmem_cache_free(svc_rdma_map_cachep, map); +} + /* ib_cq event handler */ static void cq_event_handler(struct ib_event *event, void *context) { ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing 2008-07-03 2:27 ` [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 2008-07-03 2:27 ` [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker Use the new svc_rdma_req_map data type for mapping the client side memory to the server side memory. Move the DMA mapping to the context pointed to by each WR individually so that it is unmapped after the WR completes. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- net/sunrpc/xprtrdma/svc_rdma_sendto.c | 163 ++++++++++++++---------------- net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +- 2 files changed, 80 insertions(+), 88 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index fb82b1b..bdc11a3 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -63,52 +63,44 @@ * SGE[2..sge_count-2] data from xdr->pages[] * SGE[sge_count-1] data from xdr->tail. * + * The max SGE we need is the length of the XDR / pagesize + one for + * head + one for tail + one for RPCRDMA header. Since RPCSVC_MAXPAGES + * reserves a page for both the request and the reply header, and this + * array is only concerned with the reply we are assured that we have + * on extra page for the RPCRMDA header. */ -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, - struct xdr_buf *xdr, - struct ib_sge *sge, - int *sge_count) +static void xdr_to_sge(struct svcxprt_rdma *xprt, + struct xdr_buf *xdr, + struct svc_rdma_req_map *vec) { - /* Max we need is the length of the XDR / pagesize + one for - * head + one for tail + one for RPCRDMA header - */ int sge_max = (xdr->len+PAGE_SIZE-1) / PAGE_SIZE + 3; int sge_no; - u32 byte_count = xdr->len; u32 sge_bytes; u32 page_bytes; - int page_off; + u32 page_off; int page_no; + BUG_ON(xdr->len != + (xdr->head[0].iov_len + xdr->page_len + xdr->tail[0].iov_len)); + /* Skip the first sge, this is for the RPCRDMA header */ sge_no = 1; /* Head SGE */ - sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device, - xdr->head[0].iov_base, - xdr->head[0].iov_len, - DMA_TO_DEVICE); - sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len); - byte_count -= sge_bytes; - sge[sge_no].length = sge_bytes; - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; + vec->sge[sge_no].iov_base = xdr->head[0].iov_base; + vec->sge[sge_no].iov_len = xdr->head[0].iov_len; sge_no++; /* pages SGE */ page_no = 0; page_bytes = xdr->page_len; page_off = xdr->page_base; - while (byte_count && page_bytes) { - sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off)); - sge[sge_no].addr = - ib_dma_map_page(xprt->sc_cm_id->device, - xdr->pages[page_no], page_off, - sge_bytes, DMA_TO_DEVICE); - sge_bytes = min(sge_bytes, page_bytes); - byte_count -= sge_bytes; + while (page_bytes) { + vec->sge[sge_no].iov_base = + page_address(xdr->pages[page_no]) + page_off; + sge_bytes = min_t(u32, page_bytes, (PAGE_SIZE - page_off)); page_bytes -= sge_bytes; - sge[sge_no].length = sge_bytes; - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; + vec->sge[sge_no].iov_len = sge_bytes; sge_no++; page_no++; @@ -116,36 +108,24 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, } /* Tail SGE */ - if (byte_count && xdr->tail[0].iov_len) { - sge[sge_no].addr = - ib_dma_map_single(xprt->sc_cm_id->device, - xdr->tail[0].iov_base, - xdr->tail[0].iov_len, - DMA_TO_DEVICE); - sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len); - byte_count -= sge_bytes; - sge[sge_no].length = sge_bytes; - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; + if (xdr->tail[0].iov_len) { + vec->sge[sge_no].iov_base = xdr->tail[0].iov_base; + vec->sge[sge_no].iov_len = xdr->tail[0].iov_len; sge_no++; } BUG_ON(sge_no > sge_max); - BUG_ON(byte_count != 0); - - *sge_count = sge_no; - return sge; + vec->count = sge_no; } - /* Assumptions: * - The specified write_len can be represented in sc_max_sge * PAGE_SIZE */ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, u32 rmr, u64 to, u32 xdr_off, int write_len, - struct ib_sge *xdr_sge, int sge_count) + struct svc_rdma_req_map *vec) { - struct svc_rdma_op_ctxt *tmp_sge_ctxt; struct ib_send_wr write_wr; struct ib_sge *sge; int xdr_sge_no; @@ -154,25 +134,23 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, int sge_off; int bc; struct svc_rdma_op_ctxt *ctxt; - int ret = 0; - BUG_ON(sge_count > RPCSVC_MAXPAGES); + BUG_ON(vec->count > RPCSVC_MAXPAGES); dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, " - "write_len=%d, xdr_sge=%p, sge_count=%d\n", + "write_len=%d, vec->sge=%p, vec->count=%lu\n", rmr, (unsigned long long)to, xdr_off, - write_len, xdr_sge, sge_count); + write_len, vec->sge, vec->count); ctxt = svc_rdma_get_context(xprt); - ctxt->count = 0; - tmp_sge_ctxt = svc_rdma_get_context(xprt); - sge = tmp_sge_ctxt->sge; + ctxt->direction = DMA_TO_DEVICE; + sge = ctxt->sge; /* Find the SGE associated with xdr_off */ - for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < sge_count; + for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count; xdr_sge_no++) { - if (xdr_sge[xdr_sge_no].length > bc) + if (vec->sge[xdr_sge_no].iov_len > bc) break; - bc -= xdr_sge[xdr_sge_no].length; + bc -= vec->sge[xdr_sge_no].iov_len; } sge_off = bc; @@ -180,21 +158,27 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, sge_no = 0; /* Copy the remaining SGE */ - while (bc != 0 && xdr_sge_no < sge_count) { - sge[sge_no].addr = xdr_sge[xdr_sge_no].addr + sge_off; - sge[sge_no].lkey = xdr_sge[xdr_sge_no].lkey; + while (bc != 0 && xdr_sge_no < vec->count) { + sge[sge_no].lkey = xprt->sc_phys_mr->lkey; sge_bytes = min((size_t)bc, - (size_t)(xdr_sge[xdr_sge_no].length-sge_off)); + (size_t)(vec->sge[xdr_sge_no].iov_len-sge_off)); sge[sge_no].length = sge_bytes; - + sge[sge_no].addr = + ib_dma_map_single(xprt->sc_cm_id->device, + (void *) + vec->sge[xdr_sge_no].iov_base + sge_off, + sge_bytes, DMA_TO_DEVICE); + if (dma_mapping_error(sge[sge_no].addr)) + goto err; sge_off = 0; sge_no++; + ctxt->count++; xdr_sge_no++; bc -= sge_bytes; } BUG_ON(bc != 0); - BUG_ON(xdr_sge_no > sge_count); + BUG_ON(xdr_sge_no > vec->count); /* Prepare WRITE WR */ memset(&write_wr, 0, sizeof write_wr); @@ -209,21 +193,20 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, /* Post It */ atomic_inc(&rdma_stat_write); - if (svc_rdma_send(xprt, &write_wr)) { - svc_rdma_put_context(ctxt, 1); - /* Fatal error, close transport */ - ret = -EIO; - } - svc_rdma_put_context(tmp_sge_ctxt, 0); - return ret; + if (svc_rdma_send(xprt, &write_wr)) + goto err; + return 0; + err: + svc_rdma_put_context(ctxt, 0); + /* Fatal error, close transport */ + return -EIO; } static int send_write_chunks(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rdma_argp, struct rpcrdma_msg *rdma_resp, struct svc_rqst *rqstp, - struct ib_sge *sge, - int sge_count) + struct svc_rdma_req_map *vec) { u32 xfer_len = rqstp->rq_res.page_len + rqstp->rq_res.tail[0].iov_len; int write_len; @@ -269,8 +252,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, rs_offset + chunk_off, xdr_off, this_write, - sge, - sge_count); + vec); if (ret) { dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n", ret); @@ -292,8 +274,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rdma_argp, struct rpcrdma_msg *rdma_resp, struct svc_rqst *rqstp, - struct ib_sge *sge, - int sge_count) + struct svc_rdma_req_map *vec) { u32 xfer_len = rqstp->rq_res.len; int write_len; @@ -341,8 +322,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, rs_offset + chunk_off, xdr_off, this_write, - sge, - sge_count); + vec); if (ret) { dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n", ret); @@ -380,7 +360,7 @@ static int send_reply(struct svcxprt_rdma *rdma, struct page *page, struct rpcrdma_msg *rdma_resp, struct svc_rdma_op_ctxt *ctxt, - int sge_count, + struct svc_rdma_req_map *vec, int byte_count) { struct ib_send_wr send_wr; @@ -413,10 +393,15 @@ static int send_reply(struct svcxprt_rdma *rdma, ctxt->sge[0].lkey = rdma->sc_phys_mr->lkey; /* Determine how many of our SGE are to be transmitted */ - for (sge_no = 1; byte_count && sge_no < sge_count; sge_no++) { - sge_bytes = min((size_t)ctxt->sge[sge_no].length, - (size_t)byte_count); + for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) { + sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count); byte_count -= sge_bytes; + ctxt->sge[sge_no].addr = + ib_dma_map_single(rdma->sc_cm_id->device, + vec->sge[sge_no].iov_base, + sge_bytes, DMA_TO_DEVICE); + ctxt->sge[sge_no].length = sge_bytes; + ctxt->sge[sge_no].lkey = rdma->sc_phys_mr->lkey; } BUG_ON(byte_count != 0); @@ -428,8 +413,10 @@ static int send_reply(struct svcxprt_rdma *rdma, ctxt->pages[page_no+1] = rqstp->rq_respages[page_no]; ctxt->count++; rqstp->rq_respages[page_no] = NULL; + /* If there are more pages than SGE, terminate SGE list */ + if (page_no+1 >= sge_no) + ctxt->sge[page_no+1].length = 0; } - BUG_ON(sge_no > rdma->sc_max_sge); memset(&send_wr, 0, sizeof send_wr); ctxt->wr_op = IB_WR_SEND; @@ -473,20 +460,20 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) enum rpcrdma_proc reply_type; int ret; int inline_bytes; - struct ib_sge *sge; - int sge_count = 0; struct page *res_page; struct svc_rdma_op_ctxt *ctxt; + struct svc_rdma_req_map *vec; dprintk("svcrdma: sending response for rqstp=%p\n", rqstp); /* Get the RDMA request header. */ rdma_argp = xdr_start(&rqstp->rq_arg); - /* Build an SGE for the XDR */ + /* Build an req vec for the XDR */ ctxt = svc_rdma_get_context(rdma); ctxt->direction = DMA_TO_DEVICE; - sge = xdr_to_sge(rdma, &rqstp->rq_res, ctxt->sge, &sge_count); + vec = svc_rdma_get_req_map(); + xdr_to_sge(rdma, &rqstp->rq_res, vec); inline_bytes = rqstp->rq_res.len; @@ -503,7 +490,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) /* Send any write-chunk data and build resp write-list */ ret = send_write_chunks(rdma, rdma_argp, rdma_resp, - rqstp, sge, sge_count); + rqstp, vec); if (ret < 0) { printk(KERN_ERR "svcrdma: failed to send write chunks, rc=%d\n", ret); @@ -513,7 +500,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) /* Send any reply-list data and update resp reply-list */ ret = send_reply_chunks(rdma, rdma_argp, rdma_resp, - rqstp, sge, sge_count); + rqstp, vec); if (ret < 0) { printk(KERN_ERR "svcrdma: failed to send reply chunks, rc=%d\n", ret); @@ -521,11 +508,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp) } inline_bytes -= ret; - ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, sge_count, + ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec, inline_bytes); + svc_rdma_put_req_map(vec); dprintk("svcrdma: send_reply returns %d\n", ret); return ret; error: + svc_rdma_put_req_map(vec); svc_rdma_put_context(ctxt, 0); put_page(res_page); return ret; diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index ae90758..fc86338 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -387,10 +387,13 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt) switch (ctxt->wr_op) { case IB_WR_SEND: - case IB_WR_RDMA_WRITE: svc_rdma_put_context(ctxt, 1); break; + case IB_WR_RDMA_WRITE: + svc_rdma_put_context(ctxt, 0); + break; + case IB_WR_RDMA_READ: if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) { struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr; ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing 2008-07-03 2:27 ` [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 2008-07-03 2:27 ` [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker Modify the RDMA_READ processing to use the reply and chunk list mapping data types. Also add a special purpose 'hdr_count' field in in the context to hold the header page count instead of overloading the SGE length field and corrupting the DMA map length. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- include/linux/sunrpc/svc_rdma.h | 1 + net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 83 ++++++++++++++----------------- 2 files changed, 39 insertions(+), 45 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index bd8749c..fd5e8a1 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -72,6 +72,7 @@ extern atomic_t rdma_stat_sq_prod; */ struct svc_rdma_op_ctxt { struct svc_rdma_op_ctxt *read_hdr; + int hdr_count; struct list_head free_list; struct xdr_buf arg; struct list_head dto_q; diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 06ab484..d25971b 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -112,11 +112,6 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp, rqstp->rq_arg.tail[0].iov_len = 0; } -struct chunk_sge { - int start; /* sge no for this chunk */ - int count; /* sge count for this chunk */ -}; - /* Encode a read-chunk-list as an array of IB SGE * * Assumptions: @@ -134,8 +129,8 @@ static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, struct svc_rdma_op_ctxt *head, struct rpcrdma_msg *rmsgp, - struct ib_sge *sge, - struct chunk_sge *ch_sge_ary, + struct svc_rdma_req_map *rpl_map, + struct svc_rdma_req_map *chl_map, int ch_count, int byte_count) { @@ -156,22 +151,18 @@ static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt, head->arg.head[0] = rqstp->rq_arg.head[0]; head->arg.tail[0] = rqstp->rq_arg.tail[0]; head->arg.pages = &head->pages[head->count]; - head->sge[0].length = head->count; /* save count of hdr pages */ + head->hdr_count = head->count; /* save count of hdr pages */ head->arg.page_base = 0; head->arg.page_len = ch_bytes; head->arg.len = rqstp->rq_arg.len + ch_bytes; head->arg.buflen = rqstp->rq_arg.buflen + ch_bytes; head->count++; - ch_sge_ary[0].start = 0; + chl_map->ch[0].start = 0; while (byte_count) { + rpl_map->sge[sge_no].iov_base = + page_address(rqstp->rq_arg.pages[page_no]) + page_off; sge_bytes = min_t(int, PAGE_SIZE-page_off, ch_bytes); - sge[sge_no].addr = - ib_dma_map_page(xprt->sc_cm_id->device, - rqstp->rq_arg.pages[page_no], - page_off, sge_bytes, - DMA_FROM_DEVICE); - sge[sge_no].length = sge_bytes; - sge[sge_no].lkey = xprt->sc_phys_mr->lkey; + rpl_map->sge[sge_no].iov_len = sge_bytes; /* * Don't bump head->count here because the same page * may be used by multiple SGE. @@ -187,11 +178,11 @@ static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt, * SGE, move to the next SGE */ if (ch_bytes == 0) { - ch_sge_ary[ch_no].count = - sge_no - ch_sge_ary[ch_no].start; + chl_map->ch[ch_no].count = + sge_no - chl_map->ch[ch_no].start; ch_no++; ch++; - ch_sge_ary[ch_no].start = sge_no; + chl_map->ch[ch_no].start = sge_no; ch_bytes = ch->rc_target.rs_length; /* If bytes remaining account for next chunk */ if (byte_count) { @@ -220,18 +211,24 @@ static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt, return sge_no; } -static void rdma_set_ctxt_sge(struct svc_rdma_op_ctxt *ctxt, - struct ib_sge *sge, +static void rdma_set_ctxt_sge(struct svcxprt_rdma *xprt, + struct svc_rdma_op_ctxt *ctxt, + struct kvec *vec, u64 *sgl_offset, int count) { int i; ctxt->count = count; + ctxt->direction = DMA_FROM_DEVICE; for (i = 0; i < count; i++) { - ctxt->sge[i].addr = sge[i].addr; - ctxt->sge[i].length = sge[i].length; - *sgl_offset = *sgl_offset + sge[i].length; + ctxt->sge[i].addr = + ib_dma_map_single(xprt->sc_cm_id->device, + vec[i].iov_base, vec[i].iov_len, + DMA_FROM_DEVICE); + ctxt->sge[i].length = vec[i].iov_len; + ctxt->sge[i].lkey = xprt->sc_phys_mr->lkey; + *sgl_offset = *sgl_offset + vec[i].iov_len; } } @@ -282,34 +279,29 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt, struct ib_send_wr read_wr; int err = 0; int ch_no; - struct ib_sge *sge; int ch_count; int byte_count; int sge_count; u64 sgl_offset; struct rpcrdma_read_chunk *ch; struct svc_rdma_op_ctxt *ctxt = NULL; - struct svc_rdma_op_ctxt *tmp_sge_ctxt; - struct svc_rdma_op_ctxt *tmp_ch_ctxt; - struct chunk_sge *ch_sge_ary; + struct svc_rdma_req_map *rpl_map; + struct svc_rdma_req_map *chl_map; /* If no read list is present, return 0 */ ch = svc_rdma_get_read_chunk(rmsgp); if (!ch) return 0; - /* Allocate temporary contexts to keep SGE */ - BUG_ON(sizeof(struct ib_sge) < sizeof(struct chunk_sge)); - tmp_sge_ctxt = svc_rdma_get_context(xprt); - sge = tmp_sge_ctxt->sge; - tmp_ch_ctxt = svc_rdma_get_context(xprt); - ch_sge_ary = (struct chunk_sge *)tmp_ch_ctxt->sge; + /* Allocate temporary reply and chunk maps */ + rpl_map = svc_rdma_get_req_map(); + chl_map = svc_rdma_get_req_map(); svc_rdma_rcl_chunk_counts(ch, &ch_count, &byte_count); if (ch_count > RPCSVC_MAXPAGES) return -EINVAL; sge_count = rdma_rcl_to_sge(xprt, rqstp, hdr_ctxt, rmsgp, - sge, ch_sge_ary, + rpl_map, chl_map, ch_count, byte_count); sgl_offset = 0; ch_no = 0; @@ -331,14 +323,15 @@ next_sge: read_wr.wr.rdma.remote_addr = get_unaligned(&(ch->rc_target.rs_offset)) + sgl_offset; - read_wr.sg_list = &sge[ch_sge_ary[ch_no].start]; + read_wr.sg_list = ctxt->sge; read_wr.num_sge = - rdma_read_max_sge(xprt, ch_sge_ary[ch_no].count); - rdma_set_ctxt_sge(ctxt, &sge[ch_sge_ary[ch_no].start], + rdma_read_max_sge(xprt, chl_map->ch[ch_no].count); + rdma_set_ctxt_sge(xprt, ctxt, + &rpl_map->sge[chl_map->ch[ch_no].start], &sgl_offset, read_wr.num_sge); if (((ch+1)->rc_discrim == 0) && - (read_wr.num_sge == ch_sge_ary[ch_no].count)) { + (read_wr.num_sge == chl_map->ch[ch_no].count)) { /* * Mark the last RDMA_READ with a bit to * indicate all RPC data has been fetched from @@ -358,9 +351,9 @@ next_sge: } atomic_inc(&rdma_stat_read); - if (read_wr.num_sge < ch_sge_ary[ch_no].count) { - ch_sge_ary[ch_no].count -= read_wr.num_sge; - ch_sge_ary[ch_no].start += read_wr.num_sge; + if (read_wr.num_sge < chl_map->ch[ch_no].count) { + chl_map->ch[ch_no].count -= read_wr.num_sge; + chl_map->ch[ch_no].start += read_wr.num_sge; goto next_sge; } sgl_offset = 0; @@ -368,8 +361,8 @@ next_sge: } out: - svc_rdma_put_context(tmp_sge_ctxt, 0); - svc_rdma_put_context(tmp_ch_ctxt, 0); + svc_rdma_put_req_map(rpl_map); + svc_rdma_put_req_map(chl_map); /* Detach arg pages. svc_recv will replenish them */ for (ch_no = 0; &rqstp->rq_pages[ch_no] < rqstp->rq_respages; ch_no++) @@ -399,7 +392,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp, rqstp->rq_pages[page_no] = head->pages[page_no]; } /* Point rq_arg.pages past header */ - rqstp->rq_arg.pages = &rqstp->rq_pages[head->sge[0].length]; + rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count]; rqstp->rq_arg.page_len = head->arg.page_len; rqstp->rq_arg.page_base = head->arg.page_base; ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler 2008-07-03 2:27 ` [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 2008-07-03 2:27 ` [PATCH 05/11] svcrdma: Add dma map count and WARN_ON Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker Separate DMA unmap from context destruction and perform DMA unmapping in the SQ/RQ CQ reap functions. This is necessary to support software based RDMA implementations that actually copy the data in their ib_dma_unmap callback functions and architectures that don't have cache coherent I/O busses. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- net/sunrpc/xprtrdma/svc_rdma_transport.c | 20 ++++++++++++++------ 1 files changed, 14 insertions(+), 6 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index fc86338..7e8ee66 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -150,6 +150,18 @@ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt) return ctxt; } +static void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt) +{ + struct svcxprt_rdma *xprt = ctxt->xprt; + int i; + for (i = 0; i < ctxt->count && ctxt->sge[i].length; i++) { + ib_dma_unmap_single(xprt->sc_cm_id->device, + ctxt->sge[i].addr, + ctxt->sge[i].length, + ctxt->direction); + } +} + void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages) { struct svcxprt_rdma *xprt; @@ -161,12 +173,6 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages) for (i = 0; i < ctxt->count; i++) put_page(ctxt->pages[i]); - for (i = 0; i < ctxt->count; i++) - ib_dma_unmap_single(xprt->sc_cm_id->device, - ctxt->sge[i].addr, - ctxt->sge[i].length, - ctxt->direction); - spin_lock_bh(&xprt->sc_ctxt_lock); list_add(&ctxt->free_list, &xprt->sc_ctxt_free); spin_unlock_bh(&xprt->sc_ctxt_lock); @@ -328,6 +334,7 @@ static void rq_cq_reap(struct svcxprt_rdma *xprt) ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id; ctxt->wc_status = wc.status; ctxt->byte_len = wc.byte_len; + svc_rdma_unmap_dma(ctxt); if (wc.status != IB_WC_SUCCESS) { /* Close the transport */ dprintk("svcrdma: transport closing putting ctxt %p\n", ctxt); @@ -377,6 +384,7 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt) ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id; xprt = ctxt->xprt; + svc_rdma_unmap_dma(ctxt); if (wc.status != IB_WC_SUCCESS) /* Close the transport */ set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags); ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 05/11] svcrdma: Add dma map count and WARN_ON 2008-07-03 2:27 ` [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 2008-07-03 2:27 ` [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker Add a dma map count in order to verify that all DMA mapping resources have been freed when the transport is closed. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- include/linux/sunrpc/svc_rdma.h | 1 + net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 + net/sunrpc/xprtrdma/svc_rdma_sendto.c | 3 +++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +++++ 4 files changed, 10 insertions(+), 0 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index fd5e8a1..ab93afc 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -130,6 +130,7 @@ struct svcxprt_rdma { struct ib_pd *sc_pd; + atomic_t sc_dma_used; atomic_t sc_ctxt_used; struct list_head sc_ctxt_free; int sc_ctxt_cnt; diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index d25971b..b4b17f4 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -222,6 +222,7 @@ static void rdma_set_ctxt_sge(struct svcxprt_rdma *xprt, ctxt->count = count; ctxt->direction = DMA_FROM_DEVICE; for (i = 0; i < count; i++) { + atomic_inc(&xprt->sc_dma_used); ctxt->sge[i].addr = ib_dma_map_single(xprt->sc_cm_id->device, vec[i].iov_base, vec[i].iov_len, diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index bdc11a3..a19b22b 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -163,6 +163,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, sge_bytes = min((size_t)bc, (size_t)(vec->sge[xdr_sge_no].iov_len-sge_off)); sge[sge_no].length = sge_bytes; + atomic_inc(&xprt->sc_dma_used); sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device, (void *) @@ -385,6 +386,7 @@ static int send_reply(struct svcxprt_rdma *rdma, ctxt->count = 1; /* Prepare the SGE for the RPCRDMA Header */ + atomic_inc(&rdma->sc_dma_used); ctxt->sge[0].addr = ib_dma_map_page(rdma->sc_cm_id->device, page, 0, PAGE_SIZE, DMA_TO_DEVICE); @@ -396,6 +398,7 @@ static int send_reply(struct svcxprt_rdma *rdma, for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) { sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count); byte_count -= sge_bytes; + atomic_inc(&rdma->sc_dma_used); ctxt->sge[sge_no].addr = ib_dma_map_single(rdma->sc_cm_id->device, vec->sge[sge_no].iov_base, diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 7e8ee66..6fddd58 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -155,6 +155,7 @@ static void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt) struct svcxprt_rdma *xprt = ctxt->xprt; int i; for (i = 0; i < ctxt->count && ctxt->sge[i].length; i++) { + atomic_dec(&xprt->sc_dma_used); ib_dma_unmap_single(xprt->sc_cm_id->device, ctxt->sge[i].addr, ctxt->sge[i].length, @@ -519,6 +520,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv, cma_xprt->sc_max_requests = svcrdma_max_requests; cma_xprt->sc_sq_depth = svcrdma_max_requests * RPCRDMA_SQ_DEPTH_MULT; atomic_set(&cma_xprt->sc_sq_count, 0); + atomic_set(&cma_xprt->sc_ctxt_used, 0); if (!listener) { int reqs = cma_xprt->sc_max_requests; @@ -569,6 +571,7 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt) BUG_ON(sge_no >= xprt->sc_max_sge); page = svc_rdma_get_page(); ctxt->pages[sge_no] = page; + atomic_inc(&xprt->sc_dma_used); pa = ib_dma_map_page(xprt->sc_cm_id->device, page, 0, PAGE_SIZE, DMA_FROM_DEVICE); @@ -1049,6 +1052,7 @@ static void __svc_rdma_free(struct work_struct *work) /* Warn if we leaked a resource or under-referenced */ WARN_ON(atomic_read(&rdma->sc_ctxt_used) != 0); + WARN_ON(atomic_read(&rdma->sc_dma_used) != 0); /* Destroy the QP if present (not a listener) */ if (rdma->sc_qp && !IS_ERR(rdma->sc_qp)) @@ -1169,6 +1173,7 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp, length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va); /* Prepare SGE for local address */ + atomic_inc(&xprt->sc_dma_used); sge.addr = ib_dma_map_page(xprt->sc_cm_id->device, p, 0, PAGE_SIZE, DMA_FROM_DEVICE); sge.lkey = xprt->sc_phys_mr->lkey; ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free 2008-07-03 2:27 ` [PATCH 05/11] svcrdma: Add dma map count and WARN_ON Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 2008-07-03 2:27 ` [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker At the time __svc_rdma_free is called, we are guaranteed that all references to this transport are gone. There is, therefore, no need to protect the resource lists with a spin lock. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- net/sunrpc/xprtrdma/svc_rdma_transport.c | 4 ---- 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 6fddd58..7647789 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -1027,7 +1027,6 @@ static void __svc_rdma_free(struct work_struct *work) * cm_id because the device ptr is needed to unmap the dma in * svc_rdma_put_context. */ - spin_lock_bh(&rdma->sc_read_complete_lock); while (!list_empty(&rdma->sc_read_complete_q)) { struct svc_rdma_op_ctxt *ctxt; ctxt = list_entry(rdma->sc_read_complete_q.next, @@ -1036,10 +1035,8 @@ static void __svc_rdma_free(struct work_struct *work) list_del_init(&ctxt->dto_q); svc_rdma_put_context(ctxt, 1); } - spin_unlock_bh(&rdma->sc_read_complete_lock); /* Destroy queued, but not processed recv completions */ - spin_lock_bh(&rdma->sc_rq_dto_lock); while (!list_empty(&rdma->sc_rq_dto_q)) { struct svc_rdma_op_ctxt *ctxt; ctxt = list_entry(rdma->sc_rq_dto_q.next, @@ -1048,7 +1045,6 @@ static void __svc_rdma_free(struct work_struct *work) list_del_init(&ctxt->dto_q); svc_rdma_put_context(ctxt, 1); } - spin_unlock_bh(&rdma->sc_rq_dto_lock); /* Warn if we leaked a resource or under-referenced */ WARN_ON(atomic_read(&rdma->sc_ctxt_used) != 0); ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure 2008-07-03 2:27 ` [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 2008-07-03 2:27 ` [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker The sc_read_wait queue head is no longer used. Remove it. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- include/linux/sunrpc/svc_rdma.h | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index ab93afc..d8d74c4 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -119,7 +119,6 @@ struct svcxprt_rdma { struct rdma_cm_id *sc_cm_id; /* RDMA connection id */ struct list_head sc_accept_q; /* Conn. waiting accept */ int sc_ord; /* RDMA read limit */ - wait_queue_head_t sc_read_wait; int sc_max_sge; int sc_sq_depth; /* Depth of SQ */ ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD 2008-07-03 2:27 ` [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 2008-07-03 2:27 ` [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker When adapters have differing IRD limits, the RDMA transport will fail to connect properly. The RDMA transport should use the client's advertised inbound read limit when computing its outbound read limit. For iWARP transports, there is currently no standard for exchanging IRD/ORD during connection establishment so the 'responder_resources' field in the connect event is the local device's limit. The RDMA transport can be configured to use a smaller ORD by writing the desired number to the /proc/sys/sunrpc/svc_rdma/max_outbound_read_requests file. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- net/sunrpc/xprtrdma/svc_rdma_transport.c | 16 ++++++++++++---- 1 files changed, 12 insertions(+), 4 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 7647789..80104f4 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -606,7 +606,7 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt) * will call the recvfrom method on the listen xprt which will accept the new * connection. */ -static void handle_connect_req(struct rdma_cm_id *new_cma_id) +static void handle_connect_req(struct rdma_cm_id *new_cma_id, size_t client_ird) { struct svcxprt_rdma *listen_xprt = new_cma_id->context; struct svcxprt_rdma *newxprt; @@ -623,6 +623,9 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id) dprintk("svcrdma: Creating newxprt=%p, cm_id=%p, listenxprt=%p\n", newxprt, newxprt->sc_cm_id, listen_xprt); + /* Save client advertised inbound read limit for use later in accept. */ + newxprt->sc_ord = client_ird; + /* Set the local and remote addresses in the transport */ sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr; svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa)); @@ -659,7 +662,8 @@ static int rdma_listen_handler(struct rdma_cm_id *cma_id, case RDMA_CM_EVENT_CONNECT_REQUEST: dprintk("svcrdma: Connect request on cma_id=%p, xprt = %p, " "event=%d\n", cma_id, cma_id->context, event->event); - handle_connect_req(cma_id); + handle_connect_req(cma_id, + event->param.conn.responder_resources); break; case RDMA_CM_EVENT_ESTABLISHED: @@ -833,8 +837,12 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) (size_t)svcrdma_max_requests); newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests; - newxprt->sc_ord = min((size_t)devattr.max_qp_rd_atom, - (size_t)svcrdma_ord); + /* + * Limit ORD based on client limit, local device limit, and + * configured svcrdma limit. + */ + newxprt->sc_ord = min_t(size_t, devattr.max_qp_rd_atom, newxprt->sc_ord); + newxprt->sc_ord = min_t(size_t, svcrdma_ord, newxprt->sc_ord); newxprt->sc_pd = ib_alloc_pd(newxprt->sc_cm_id->device); if (IS_ERR(newxprt->sc_pd)) { ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function 2008-07-03 2:27 ` [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 2008-07-03 2:27 ` [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts Tom Tucker 0 siblings, 1 reply; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker Make certain all transports pending free are flushed from the wq before unloading the module. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- net/sunrpc/xprtrdma/svc_rdma.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c index 171f205..527acfd 100644 --- a/net/sunrpc/xprtrdma/svc_rdma.c +++ b/net/sunrpc/xprtrdma/svc_rdma.c @@ -239,6 +239,7 @@ static ctl_table svcrdma_root_table[] = { void svc_rdma_cleanup(void) { dprintk("SVCRDMA Module Removed, deregister RPC RDMA transport\n"); + flush_scheduled_work(); if (svcrdma_table_header) { unregister_sysctl_table(svcrdma_table_header); svcrdma_table_header = NULL; ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts 2008-07-03 2:27 ` [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function Tom Tucker @ 2008-07-03 2:27 ` Tom Tucker 0 siblings, 0 replies; 20+ messages in thread From: Tom Tucker @ 2008-07-03 2:27 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, Tom Tucker Create a kmem cache to hold WR contexts. Next we will convert the WR context get and put services to use this kmem cache. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> --- net/sunrpc/xprtrdma/svc_rdma.c | 23 +++++++++++++++++++---- 1 files changed, 19 insertions(+), 4 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c index 527acfd..8710117 100644 --- a/net/sunrpc/xprtrdma/svc_rdma.c +++ b/net/sunrpc/xprtrdma/svc_rdma.c @@ -69,8 +69,9 @@ atomic_t rdma_stat_rq_prod; atomic_t rdma_stat_sq_poll; atomic_t rdma_stat_sq_prod; -/* Temporary NFS request map cache */ +/* Temporary NFS request map and context caches */ struct kmem_cache *svc_rdma_map_cachep; +struct kmem_cache *svc_rdma_ctxt_cachep; /* * This function implements reading and resetting an atomic_t stat @@ -246,6 +247,7 @@ void svc_rdma_cleanup(void) } svc_unreg_xprt_class(&svc_rdma_class); kmem_cache_destroy(svc_rdma_map_cachep); + kmem_cache_destroy(svc_rdma_ctxt_cachep); } int svc_rdma_init(void) @@ -268,14 +270,27 @@ int svc_rdma_init(void) NULL); if (!svc_rdma_map_cachep) { printk(KERN_INFO "Could not allocate map cache.\n"); - goto err; + goto err0; + } + + /* Create the temporary context cache */ + svc_rdma_ctxt_cachep = + kmem_cache_create("svc_rdma_ctxt_cache", + sizeof(struct svc_rdma_op_ctxt), + 0, + SLAB_HWCACHE_ALIGN, + NULL); + if (!svc_rdma_ctxt_cachep) { + printk(KERN_INFO "Could not allocate WR ctxt cache.\n"); + goto err1; } /* Register RDMA with the SVC transport switch */ svc_reg_xprt_class(&svc_rdma_class); return 0; - - err: + err1: + kmem_cache_destroy(svc_rdma_map_cachep); + err0: unregister_sysctl_table(svcrdma_table_header); return -ENOMEM; } ^ permalink raw reply related [flat|nested] 20+ messages in thread
end of thread, other threads:[~2008-07-03 2:27 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <12120836962076-git-send-email-tom@opengridcomputing.com>
2008-05-29 22:10 ` [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup J. Bruce Fields
2008-05-29 22:25 ` Tom Tucker
[not found] ` <1212099937.22478.3.camel-SMNkleLxa3ZimH42XvhXlA@public.gmane.org>
2008-05-29 22:26 ` Roland Dreier
[not found] ` <12120836962324-git-send-email-tom@opengridcomputing.com>
2008-06-16 19:48 ` [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping J. Bruce Fields
2008-06-21 16:31 ` Tom Tucker
2008-06-23 18:27 ` J. Bruce Fields
2008-06-24 2:58 ` Tom Tucker
2008-06-24 19:58 ` J. Bruce Fields
2008-06-24 20:31 ` Benny Halevy
2008-06-24 20:38 ` Trond Myklebust
[not found] ` <12120836963727-git-send-email-tom@opengridcomputing.com>
2008-06-16 21:04 ` [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing J. Bruce Fields
2008-06-21 16:26 ` Tom Tucker
2008-06-23 18:21 ` J. Bruce Fields
2008-06-24 2:29 ` Tom Tucker
2008-06-21 16:51 ` Tom Tucker
2008-06-23 18:51 ` J. Bruce Fields
2008-06-24 3:02 ` Tom Tucker
[not found] ` <1212083697950-git-send-email-tom@opengridcomputing.com>
[not found] ` <1212083697236-git-send-email-tom@opengridcomputing.com>
[not found] ` <12120836973390-git-send-email-tom@opengridcomputing.com>
[not found] ` <12120836973638-git-send-email-tom@opengridcomputing.com>
[not found] ` <12120836973072-git-send-email-tom@opengridcomputing.com>
[not found] ` <12120836972503-git-send-email-tom@opengridcomputing.com>
[not found] ` <12120836973166-git-send-email-tom@opengridcomputing.com>
[not found] ` <12120836972648-git-send-email-tom@opengridcomputing.com>
2008-06-16 21:24 ` [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts J. Bruce Fields
2008-06-21 17:08 ` Tom Tucker
2008-07-03 2:27 [PATCH 00/11] svcrdma: WR context management bug fixes and cleanup Tom Tucker
2008-07-03 2:27 ` [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping Tom Tucker
2008-07-03 2:27 ` [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing Tom Tucker
2008-07-03 2:27 ` [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing Tom Tucker
2008-07-03 2:27 ` [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler Tom Tucker
2008-07-03 2:27 ` [PATCH 05/11] svcrdma: Add dma map count and WARN_ON Tom Tucker
2008-07-03 2:27 ` [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free Tom Tucker
2008-07-03 2:27 ` [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure Tom Tucker
2008-07-03 2:27 ` [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD Tom Tucker
2008-07-03 2:27 ` [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function Tom Tucker
2008-07-03 2:27 ` [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts Tom Tucker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox