Linux NFS development
 help / color / mirror / Atom feed
* Re: [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup
       [not found] <12120836962076-git-send-email-tom@opengridcomputing.com>
@ 2008-05-29 22:10 ` J. Bruce Fields
  2008-05-29 22:25   ` Tom Tucker
       [not found] ` <12120836962324-git-send-email-tom@opengridcomputing.com>
  1 sibling, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2008-05-29 22:10 UTC (permalink / raw)
  To: Tom Tucker; +Cc: linux-nfs

On Thu, May 29, 2008 at 12:54:45PM -0500, Tom Tucker wrote:
> Bruce:
> 
> This set of patches fixes bugs related to DMA mapping and unmapping and
> in the process redesigns and cleans up the way that WR contexts are managed.
> 
> In addition to the bug fixes, there are three major changes:
> 
> - The data structure that maps client side memory to server side
>   memory has been separated from the WR context. This makes for cleaner
>   type checking and a smaller memory footprint for the RDMA transport.
> 
> - The mapping and unmapping of DMA is performed independently from the
>   allocation and deallocation of the WR context. This is to allow the
>   unmapping to occur early on recv where the unmapping of the DMA needs
>   to be done prior to looking at the data and therefore before 
>   deallocating the WR context.
> 
> - The WR context cache now uses a kmem_cache and is shared across
>   all transport instances. This allows contexts to be shared between
>   mounts and idle mounts don't consume context memory.
> 
> This patchset is intended for 2.6.27 and is based on 2.6.26-rc4. These patches
> are also available at the following git url:

OK, thanks.  Bug me if I haven't commented or applied these by next
week.

Is there anything else remaining for 2.6.26 (as opposed to .27)?

--b.

> 
> git://git.linux-nfs.org/projects/tomtucker/xprt-switch-2.6.git
> 
> [PATCH 01/11] svcrdma: Add a type for keeping NFS request mapping
> 
>  include/linux/sunrpc/svc_rdma.h          |   27 +++++++++++++++++++++++++++
>  net/sunrpc/xprtrdma/svc_rdma.c           |   23 +++++++++++++++++++++++
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   26 ++++++++++++++++++++++++++
>  3 files changed, 76 insertions(+), 0 deletions(-)
> 
> [PATCH 02/11] svcrdma: Use rpc reply map for RDMA_WRITE processing
> 
>  net/sunrpc/xprtrdma/svc_rdma_sendto.c    |  122 ++++++++++++++----------------
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    5 +-
>  2 files changed, 62 insertions(+), 65 deletions(-)
> 
> [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing
> 
>  include/linux/sunrpc/svc_rdma.h         |    1 +
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   83 ++++++++++++++-----------------
>  2 files changed, 39 insertions(+), 45 deletions(-)
> 
> [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler
> 
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   20 ++++++++++++++------
>  1 files changed, 14 insertions(+), 6 deletions(-)
> 
> [PATCH 05/11] svcrdma: Add dma map count and WARN_ON
> 
>  include/linux/sunrpc/svc_rdma.h          |    1 +
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |    1 +
>  net/sunrpc/xprtrdma/svc_rdma_sendto.c    |    3 +++
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    5 +++++
>  4 files changed, 10 insertions(+), 0 deletions(-)
> 
> [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free
> 
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    4 ----
>  1 files changed, 0 insertions(+), 4 deletions(-)
> 
> [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure
> 
>  include/linux/sunrpc/svc_rdma.h |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD
> 
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   16 ++++++++++++----
>  1 files changed, 12 insertions(+), 4 deletions(-)
> 
> [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function
> 
>  net/sunrpc/xprtrdma/svc_rdma.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts
> 
>  net/sunrpc/xprtrdma/svc_rdma.c |   24 ++++++++++++++++++++----
>  1 files changed, 20 insertions(+), 4 deletions(-)
> 
> [PATCH 11/11] svcrdma: Change WR context get/put to use the kmem cache
> 
>  include/linux/sunrpc/svc_rdma.h          |    6 --
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |  121 +++---------------------------
>  2 files changed, 12 insertions(+), 115 deletions(-)
> 
> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup
  2008-05-29 22:10 ` [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup J. Bruce Fields
@ 2008-05-29 22:25   ` Tom Tucker
       [not found]     ` <1212099937.22478.3.camel-SMNkleLxa3ZimH42XvhXlA@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Tom Tucker @ 2008-05-29 22:25 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs, Roland Dreier, swise


On Thu, 2008-05-29 at 18:10 -0400, J. Bruce Fields wrote:
> On Thu, May 29, 2008 at 12:54:45PM -0500, Tom Tucker wrote:
> > Bruce:
> > 
> > This set of patches fixes bugs related to DMA mapping and unmapping and
> > in the process redesigns and cleans up the way that WR contexts are managed.
> > 
> > In addition to the bug fixes, there are three major changes:
> > 
> > - The data structure that maps client side memory to server side
> >   memory has been separated from the WR context. This makes for cleaner
> >   type checking and a smaller memory footprint for the RDMA transport.
> > 
> > - The mapping and unmapping of DMA is performed independently from the
> >   allocation and deallocation of the WR context. This is to allow the
> >   unmapping to occur early on recv where the unmapping of the DMA needs
> >   to be done prior to looking at the data and therefore before 
> >   deallocating the WR context.
> > 
> > - The WR context cache now uses a kmem_cache and is shared across
> >   all transport instances. This allows contexts to be shared between
> >   mounts and idle mounts don't consume context memory.
> > 
> > This patchset is intended for 2.6.27 and is based on 2.6.26-rc4. These patches
> > are also available at the following git url:
> 
> OK, thanks.  Bug me if I haven't commented or applied these by next
> week.
> 
> Is there anything else remaining for 2.6.26 (as opposed to .27)?

I'm inclined to say "no". But I'd like to hear if anyone out there is
dead in the water due to a bug in the 2.6.26 version.

Tom

> 
> --b.
> 
> > 
> > git://git.linux-nfs.org/projects/tomtucker/xprt-switch-2.6.git
> > 
> > [PATCH 01/11] svcrdma: Add a type for keeping NFS request mapping
> > 
> >  include/linux/sunrpc/svc_rdma.h          |   27 +++++++++++++++++++++++++++
> >  net/sunrpc/xprtrdma/svc_rdma.c           |   23 +++++++++++++++++++++++
> >  net/sunrpc/xprtrdma/svc_rdma_transport.c |   26 ++++++++++++++++++++++++++
> >  3 files changed, 76 insertions(+), 0 deletions(-)
> > 
> > [PATCH 02/11] svcrdma: Use rpc reply map for RDMA_WRITE processing
> > 
> >  net/sunrpc/xprtrdma/svc_rdma_sendto.c    |  122 ++++++++++++++----------------
> >  net/sunrpc/xprtrdma/svc_rdma_transport.c |    5 +-
> >  2 files changed, 62 insertions(+), 65 deletions(-)
> > 
> > [PATCH 03/11] svcrdma: Use reply and chunk map for RDMA_READ processing
> > 
> >  include/linux/sunrpc/svc_rdma.h         |    1 +
> >  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   83 ++++++++++++++-----------------
> >  2 files changed, 39 insertions(+), 45 deletions(-)
> > 
> > [PATCH 04/11] svcrdma: Move the DMA unmap logic to the CQ handler
> > 
> >  net/sunrpc/xprtrdma/svc_rdma_transport.c |   20 ++++++++++++++------
> >  1 files changed, 14 insertions(+), 6 deletions(-)
> > 
> > [PATCH 05/11] svcrdma: Add dma map count and WARN_ON
> > 
> >  include/linux/sunrpc/svc_rdma.h          |    1 +
> >  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |    1 +
> >  net/sunrpc/xprtrdma/svc_rdma_sendto.c    |    3 +++
> >  net/sunrpc/xprtrdma/svc_rdma_transport.c |    5 +++++
> >  4 files changed, 10 insertions(+), 0 deletions(-)
> > 
> > [PATCH 06/11] svcrdma: Remove unneeded spin locks from __svc_rdma_free
> > 
> >  net/sunrpc/xprtrdma/svc_rdma_transport.c |    4 ----
> >  1 files changed, 0 insertions(+), 4 deletions(-)
> > 
> > [PATCH 07/11] svcrdma: Remove unused wait q from svcrdma_xprt structure
> > 
> >  include/linux/sunrpc/svc_rdma.h |    1 -
> >  1 files changed, 0 insertions(+), 1 deletions(-)
> > 
> > [PATCH 08/11] svcrdma: Limit ORD based on client's advertised IRD
> > 
> >  net/sunrpc/xprtrdma/svc_rdma_transport.c |   16 ++++++++++++----
> >  1 files changed, 12 insertions(+), 4 deletions(-)
> > 
> > [PATCH 09/11] svcrdma: Add flush_scheduled_work to module exit function
> > 
> >  net/sunrpc/xprtrdma/svc_rdma.c |    1 +
> >  1 files changed, 1 insertions(+), 0 deletions(-)
> > 
> > [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts
> > 
> >  net/sunrpc/xprtrdma/svc_rdma.c |   24 ++++++++++++++++++++----
> >  1 files changed, 20 insertions(+), 4 deletions(-)
> > 
> > [PATCH 11/11] svcrdma: Change WR context get/put to use the kmem cache
> > 
> >  include/linux/sunrpc/svc_rdma.h          |    6 --
> >  net/sunrpc/xprtrdma/svc_rdma_transport.c |  121 +++---------------------------
> >  2 files changed, 12 insertions(+), 115 deletions(-)
> > 
> > Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
> > 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup
       [not found]     ` <1212099937.22478.3.camel-SMNkleLxa3ZimH42XvhXlA@public.gmane.org>
@ 2008-05-29 22:26       ` Roland Dreier
  0 siblings, 0 replies; 19+ messages in thread
From: Roland Dreier @ 2008-05-29 22:26 UTC (permalink / raw)
  To: Tom Tucker; +Cc: J. Bruce Fields, linux-nfs, swise

 > I'm inclined to say "no". But I'd like to hear if anyone out there is
 > dead in the water due to a bug in the 2.6.26 version.

To be honest I haven't had a chance to do much NFS/RDMA testing with the
current 2.6.26 tree.  But last I looked there were no major regressions
at least (still open issues I think, but definitely better than 2.6.25)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping
       [not found] ` <12120836962324-git-send-email-tom@opengridcomputing.com>
@ 2008-06-16 19:48   ` J. Bruce Fields
  2008-06-21 16:31     ` Tom Tucker
       [not found]   ` <12120836963727-git-send-email-tom@opengridcomputing.com>
  1 sibling, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2008-06-16 19:48 UTC (permalink / raw)
  To: Tom Tucker; +Cc: linux-nfs

On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote:
> Create a new data structure to hold the remote client address space
> to local server address space mapping.
> 
> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
> 
> ---
>  include/linux/sunrpc/svc_rdma.h          |   27 +++++++++++++++++++++++++++
>  net/sunrpc/xprtrdma/svc_rdma.c           |   23 +++++++++++++++++++++++
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   26 ++++++++++++++++++++++++++
>  3 files changed, 76 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index 05eb466..bd8749c 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt {
>  	struct page *pages[RPCSVC_MAXPAGES];
>  };
>  
> +/*
> + * NFS_ requests are mapped on the client side by the chunk lists in
> + * the RPCRDMA header. During the fetching of the RPC from the client
> + * and the writing of the reply to the client, the memory in the
> + * client and the memory in the server must be mapped as contiguous
> + * vaddr/len for access by the hardware. These data strucures keep
> + * these mappings.
> + *
> + * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the
> + * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the
> + * 'ch' field maps the read-list of the RPCRDMA header to the 'sge'
> + * mapping of the reply.
> + */
> +struct svc_rdma_chunk_sge {
> +	int start;		/* sge no for this chunk */
> +	int count;		/* sge count for this chunk */
> +};
> +struct svc_rdma_req_map {
> +	unsigned long count;
> +	union {
> +		struct kvec sge[RPCSVC_MAXPAGES];
> +		struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
> +	};
> +};
> +
>  #define RDMACTXT_F_LAST_CTXT	2
>  
>  struct svcxprt_rdma {
> @@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *);
>  extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
>  extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
>  extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
> +extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
> +extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
>  extern void svc_sq_reap(struct svcxprt_rdma *);
>  extern void svc_rq_reap(struct svcxprt_rdma *);
>  extern struct svc_xprt_class svc_rdma_class;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
> index 88c0ca2..545ea72 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod;
>  atomic_t rdma_stat_sq_poll;
>  atomic_t rdma_stat_sq_prod;
>  
> +/* Temporary NFS request map cache */
> +struct kmem_cache *svc_rdma_map_cachep = NULL;

No need to initialize globals to NULL.

> +
>  /*
>   * This function implements reading and resetting an atomic_t stat
>   * variable through read/write to a proc file. Any write to the file
> @@ -241,6 +244,8 @@ void svc_rdma_cleanup(void)
>  		svcrdma_table_header = NULL;
>  	}
>  	svc_unreg_xprt_class(&svc_rdma_class);
> +	if (svc_rdma_map_cachep)
> +		kmem_cache_destroy(svc_rdma_map_cachep);

By design this can't be NULL (and the code presumably would have oopsed
much earlier if it were mistakenly allowed to be) so I'm inclined not to
bother checking....

>  }
>  
>  int svc_rdma_init(void)
> @@ -255,9 +260,27 @@ int svc_rdma_init(void)
>  		svcrdma_table_header =
>  			register_sysctl_table(svcrdma_root_table);
>  
> +	/* Create the temporary map cache */
> +	svc_rdma_map_cachep = kmem_cache_create("svc_rdma_map_cache",
> +						sizeof(struct svc_rdma_req_map),
> +						0,
> +						SLAB_HWCACHE_ALIGN,
> +						NULL);
> +	if (!svc_rdma_map_cachep) {
> +		printk(KERN_INFO "Could not allocate map cache.\n");
> +		goto err;
> +	}
> +
>  	/* Register RDMA with the SVC transport switch */
>  	svc_reg_xprt_class(&svc_rdma_class);
>  	return 0;
> +
> + err:
> +	if (svcrdma_table_header) {
> +		unregister_sysctl_table(svcrdma_table_header);

unregister_sysctl_table() already handles the NULL case, so you could
skip the if (svcrdma_table_header).

--b.

> +		svcrdma_table_header = NULL;
> +	}
> +	return -ENOMEM;
>  }
>  MODULE_AUTHOR("Tom Tucker <tom@opengridcomputing.com>");
>  MODULE_DESCRIPTION("SVC RDMA Transport");
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index e132509..ae90758 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -173,6 +173,32 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages)
>  	atomic_dec(&xprt->sc_ctxt_used);
>  }
>  
> +/* Temporary NFS request map cache. Created in svc_rdma.c  */
> +extern struct kmem_cache *svc_rdma_map_cachep;
> +
> +/*
> + * Temporary NFS req mappings are shared across all transport
> + * instances. These are short lived and should be bounded by the number
> + * of concurrent server threads * depth of the SQ.
> + */
> +struct svc_rdma_req_map *svc_rdma_get_req_map(void)
> +{
> +	struct svc_rdma_req_map *map;
> +	while (1) {
> +		map = kmem_cache_alloc(svc_rdma_map_cachep, GFP_KERNEL);
> +		if (map)
> +			break;
> +		schedule_timeout_uninterruptible(msecs_to_jiffies(500));
> +	}
> +	map->count = 0;
> +	return map;
> +}
> +
> +void svc_rdma_put_req_map(struct svc_rdma_req_map *map)
> +{
> +	kmem_cache_free(svc_rdma_map_cachep, map);
> +}
> +
>  /* ib_cq event handler */
>  static void cq_event_handler(struct ib_event *event, void *context)
>  {

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing
       [not found]   ` <12120836963727-git-send-email-tom@opengridcomputing.com>
@ 2008-06-16 21:04     ` J. Bruce Fields
  2008-06-21 16:26       ` Tom Tucker
  2008-06-21 16:51       ` Tom Tucker
       [not found]     ` <1212083697950-git-send-email-tom@opengridcomputing.com>
  1 sibling, 2 replies; 19+ messages in thread
From: J. Bruce Fields @ 2008-06-16 21:04 UTC (permalink / raw)
  To: Tom Tucker; +Cc: linux-nfs

On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote:
> Use the new svc_rdma_req_map data type for mapping the client side memory
> to the server side memory. Move the DMA mapping to the context pointed to
> by each WR individually so that it is unmapped after the WR completes.
> 
> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
> 
> ---
>  net/sunrpc/xprtrdma/svc_rdma_sendto.c    |  122 ++++++++++++++----------------
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    5 +-
>  2 files changed, 62 insertions(+), 65 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> index fb82b1b..7cd65b7 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> @@ -64,10 +64,9 @@
>   * SGE[sge_count-1]    data from xdr->tail.
>   *
>   */
> -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
> -				 struct xdr_buf *xdr,
> -				 struct ib_sge *sge,
> -				 int *sge_count)
> +static void xdr_to_sge(struct svcxprt_rdma *xprt,
> +		       struct xdr_buf *xdr,
> +		       struct svc_rdma_req_map *vec)
>  {
>  	/* Max we need is the length of the XDR / pagesize + one for
>  	 * head + one for tail + one for RPCRDMA header
> @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>  	sge_no = 1;

Should the arrays in the svc_rdma_req_map actually be size
RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header?

Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(),
which is called from svc_sendto() immediately after setting:

	xb = &rqstp->rq_res;
	xb->len = xb->head[0].iov_len + 
		xb->page_len +
		xb->tail[0].iov_len;

So I think xdr_to_sge() is doing a bunch of unnecessary work to handle
the case where xdr->len could be less than that sum?

>  
>  	/* Head SGE */
> -	sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device,
> -					     xdr->head[0].iov_base,
> -					     xdr->head[0].iov_len,
> -					     DMA_TO_DEVICE);
> +	vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
>  	sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len);
>  	byte_count -= sge_bytes;
> -	sge[sge_no].length = sge_bytes;
> -	sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
> +	vec->sge[sge_no].iov_len = sge_bytes;
>  	sge_no++;
>  
>  	/* pages SGE */
> @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>  	page_bytes = xdr->page_len;
>  	page_off = xdr->page_base;
>  	while (byte_count && page_bytes) {
> +		vec->sge[sge_no].iov_base =
> +			page_address(xdr->pages[page_no]) + page_off;
>  		sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off));
> -		sge[sge_no].addr =
> -			ib_dma_map_page(xprt->sc_cm_id->device,
> -					xdr->pages[page_no], page_off,
> -					sge_bytes, DMA_TO_DEVICE);
>  		sge_bytes = min(sge_bytes, page_bytes);
>  		byte_count -= sge_bytes;
>  		page_bytes -= sge_bytes;
> -		sge[sge_no].length = sge_bytes;
> -		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
> +		vec->sge[sge_no].iov_len = sge_bytes;
>  
>  		sge_no++;
>  		page_no++;
> @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>  
>  	/* Tail SGE */
>  	if (byte_count && xdr->tail[0].iov_len) {
> -		sge[sge_no].addr =
> -			ib_dma_map_single(xprt->sc_cm_id->device,
> -					  xdr->tail[0].iov_base,
> -					  xdr->tail[0].iov_len,
> -					  DMA_TO_DEVICE);
> +		vec->sge[sge_no].iov_base = xdr->tail[0].iov_base;
>  		sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len);
>  		byte_count -= sge_bytes;
> -		sge[sge_no].length = sge_bytes;
> -		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
> +		vec->sge[sge_no].iov_len = sge_bytes;
>  		sge_no++;
>  	}
>  
>  	BUG_ON(sge_no > sge_max);
>  	BUG_ON(byte_count != 0);
>  
> -	*sge_count = sge_no;
> -	return sge;
> +	vec->count = sge_no;
>  }
>  
>  
> @@ -143,9 +129,8 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>  static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>  		      u32 rmr, u64 to,
>  		      u32 xdr_off, int write_len,
> -		      struct ib_sge *xdr_sge, int sge_count)
> +		      struct svc_rdma_req_map *vec)
>  {
> -	struct svc_rdma_op_ctxt *tmp_sge_ctxt;
>  	struct ib_send_wr write_wr;
>  	struct ib_sge *sge;
>  	int xdr_sge_no;
> @@ -156,23 +141,22 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>  	struct svc_rdma_op_ctxt *ctxt;
>  	int ret = 0;
>  
> -	BUG_ON(sge_count > RPCSVC_MAXPAGES);
> +	BUG_ON(vec->count > RPCSVC_MAXPAGES);
>  	dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, "
> -		"write_len=%d, xdr_sge=%p, sge_count=%d\n",
> +		"write_len=%d, vec->sge=%p, vec->count=%lu\n",
>  		rmr, (unsigned long long)to, xdr_off,
> -		write_len, xdr_sge, sge_count);
> +		write_len, vec->sge, vec->count);
>  
>  	ctxt = svc_rdma_get_context(xprt);
> -	ctxt->count = 0;
> -	tmp_sge_ctxt = svc_rdma_get_context(xprt);
> -	sge = tmp_sge_ctxt->sge;
> +	ctxt->direction = DMA_TO_DEVICE;
> +	sge = ctxt->sge;
>  
>  	/* Find the SGE associated with xdr_off */
> -	for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < sge_count;
> +	for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count;
>  	     xdr_sge_no++) {
> -		if (xdr_sge[xdr_sge_no].length > bc)
> +		if (vec->sge[xdr_sge_no].iov_len > bc)
>  			break;
> -		bc -= xdr_sge[xdr_sge_no].length;
> +		bc -= vec->sge[xdr_sge_no].iov_len;
>  	}
>  
>  	sge_off = bc;
> @@ -180,21 +164,27 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>  	sge_no = 0;
>  
>  	/* Copy the remaining SGE */
> -	while (bc != 0 && xdr_sge_no < sge_count) {
> -		sge[sge_no].addr = xdr_sge[xdr_sge_no].addr + sge_off;
> -		sge[sge_no].lkey = xdr_sge[xdr_sge_no].lkey;
> +	while (bc != 0 && xdr_sge_no < vec->count) {
> +		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>  		sge_bytes = min((size_t)bc,
> -				(size_t)(xdr_sge[xdr_sge_no].length-sge_off));
> +				(size_t)(vec->sge[xdr_sge_no].iov_len-sge_off));
>  		sge[sge_no].length = sge_bytes;
> -
> +		sge[sge_no].addr =
> +			ib_dma_map_single(xprt->sc_cm_id->device,
> +					  (void *)
> +					  vec->sge[xdr_sge_no].iov_base + sge_off,
> +					  sge_bytes, DMA_TO_DEVICE);
> +		if (dma_mapping_error(sge[sge_no].addr))
> +			return -EINVAL;

We leak a svc_rdma_op_ctxt on error here.

--b.

>  		sge_off = 0;
>  		sge_no++;
> +		ctxt->count++;
>  		xdr_sge_no++;
>  		bc -= sge_bytes;
>  	}
>  
>  	BUG_ON(bc != 0);
> -	BUG_ON(xdr_sge_no > sge_count);
> +	BUG_ON(xdr_sge_no > vec->count);
>  
>  	/* Prepare WRITE WR */
>  	memset(&write_wr, 0, sizeof write_wr);
> @@ -210,11 +200,10 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>  	/* Post It */
>  	atomic_inc(&rdma_stat_write);
>  	if (svc_rdma_send(xprt, &write_wr)) {
> -		svc_rdma_put_context(ctxt, 1);
> +		svc_rdma_put_context(ctxt, 0);
>  		/* Fatal error, close transport */
>  		ret = -EIO;
>  	}
> -	svc_rdma_put_context(tmp_sge_ctxt, 0);
>  	return ret;
>  }
>  
> @@ -222,8 +211,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
>  			     struct rpcrdma_msg *rdma_argp,
>  			     struct rpcrdma_msg *rdma_resp,
>  			     struct svc_rqst *rqstp,
> -			     struct ib_sge *sge,
> -			     int sge_count)
> +			     struct svc_rdma_req_map *vec)
>  {
>  	u32 xfer_len = rqstp->rq_res.page_len + rqstp->rq_res.tail[0].iov_len;
>  	int write_len;
> @@ -269,8 +257,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
>  					 rs_offset + chunk_off,
>  					 xdr_off,
>  					 this_write,
> -					 sge,
> -					 sge_count);
> +					 vec);
>  			if (ret) {
>  				dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n",
>  					ret);
> @@ -292,8 +279,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
>  			     struct rpcrdma_msg *rdma_argp,
>  			     struct rpcrdma_msg *rdma_resp,
>  			     struct svc_rqst *rqstp,
> -			     struct ib_sge *sge,
> -			     int sge_count)
> +			     struct svc_rdma_req_map *vec)
>  {
>  	u32 xfer_len = rqstp->rq_res.len;
>  	int write_len;
> @@ -341,8 +327,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
>  					 rs_offset + chunk_off,
>  					 xdr_off,
>  					 this_write,
> -					 sge,
> -					 sge_count);
> +					 vec);
>  			if (ret) {
>  				dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n",
>  					ret);
> @@ -380,7 +365,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>  		      struct page *page,
>  		      struct rpcrdma_msg *rdma_resp,
>  		      struct svc_rdma_op_ctxt *ctxt,
> -		      int sge_count,
> +		      struct svc_rdma_req_map *vec,
>  		      int byte_count)
>  {
>  	struct ib_send_wr send_wr;
> @@ -413,10 +398,15 @@ static int send_reply(struct svcxprt_rdma *rdma,
>  	ctxt->sge[0].lkey = rdma->sc_phys_mr->lkey;
>  
>  	/* Determine how many of our SGE are to be transmitted */
> -	for (sge_no = 1; byte_count && sge_no < sge_count; sge_no++) {
> -		sge_bytes = min((size_t)ctxt->sge[sge_no].length,
> -				(size_t)byte_count);
> +	for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) {
> +		sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count);
>  		byte_count -= sge_bytes;
> +		ctxt->sge[sge_no].addr =
> +			ib_dma_map_single(rdma->sc_cm_id->device,
> +					  vec->sge[sge_no].iov_base,
> +					  sge_bytes, DMA_TO_DEVICE);
> +		ctxt->sge[sge_no].length = sge_bytes;
> +		ctxt->sge[sge_no].lkey = rdma->sc_phys_mr->lkey;
>  	}
>  	BUG_ON(byte_count != 0);
>  
> @@ -428,8 +418,10 @@ static int send_reply(struct svcxprt_rdma *rdma,
>  		ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
>  		ctxt->count++;
>  		rqstp->rq_respages[page_no] = NULL;
> +		/* If there are more pages than SGE, terminate SGE list */
> +		if (page_no+1 >= sge_no)
> +			ctxt->sge[page_no+1].length = 0;
>  	}
> -
>  	BUG_ON(sge_no > rdma->sc_max_sge);
>  	memset(&send_wr, 0, sizeof send_wr);
>  	ctxt->wr_op = IB_WR_SEND;
> @@ -473,20 +465,20 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>  	enum rpcrdma_proc reply_type;
>  	int ret;
>  	int inline_bytes;
> -	struct ib_sge *sge;
> -	int sge_count = 0;
>  	struct page *res_page;
>  	struct svc_rdma_op_ctxt *ctxt;
> +	struct svc_rdma_req_map *vec;
>  
>  	dprintk("svcrdma: sending response for rqstp=%p\n", rqstp);
>  
>  	/* Get the RDMA request header. */
>  	rdma_argp = xdr_start(&rqstp->rq_arg);
>  
> -	/* Build an SGE for the XDR */
> +	/* Build an req vec for the XDR */
>  	ctxt = svc_rdma_get_context(rdma);
>  	ctxt->direction = DMA_TO_DEVICE;
> -	sge = xdr_to_sge(rdma, &rqstp->rq_res, ctxt->sge, &sge_count);
> +	vec = svc_rdma_get_req_map();
> +	xdr_to_sge(rdma, &rqstp->rq_res, vec);
>  
>  	inline_bytes = rqstp->rq_res.len;
>  
> @@ -503,7 +495,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>  
>  	/* Send any write-chunk data and build resp write-list */
>  	ret = send_write_chunks(rdma, rdma_argp, rdma_resp,
> -				rqstp, sge, sge_count);
> +				rqstp, vec);
>  	if (ret < 0) {
>  		printk(KERN_ERR "svcrdma: failed to send write chunks, rc=%d\n",
>  		       ret);
> @@ -513,7 +505,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>  
>  	/* Send any reply-list data and update resp reply-list */
>  	ret = send_reply_chunks(rdma, rdma_argp, rdma_resp,
> -				rqstp, sge, sge_count);
> +				rqstp, vec);
>  	if (ret < 0) {
>  		printk(KERN_ERR "svcrdma: failed to send reply chunks, rc=%d\n",
>  		       ret);
> @@ -521,11 +513,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>  	}
>  	inline_bytes -= ret;
>  
> -	ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, sge_count,
> +	ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec,
>  			 inline_bytes);
> +	svc_rdma_put_req_map(vec);
>  	dprintk("svcrdma: send_reply returns %d\n", ret);
>  	return ret;
>   error:
> +	svc_rdma_put_req_map(vec);
>  	svc_rdma_put_context(ctxt, 0);
>  	put_page(res_page);
>  	return ret;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index ae90758..fc86338 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -387,10 +387,13 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt)
>  
>  		switch (ctxt->wr_op) {
>  		case IB_WR_SEND:
> -		case IB_WR_RDMA_WRITE:
>  			svc_rdma_put_context(ctxt, 1);
>  			break;
>  
> +		case IB_WR_RDMA_WRITE:
> +			svc_rdma_put_context(ctxt, 0);
> +			break;
> +
>  		case IB_WR_RDMA_READ:
>  			if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
>  				struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts
       [not found]                   ` <12120836972648-git-send-email-tom@opengridcomputing.com>
@ 2008-06-16 21:24                     ` J. Bruce Fields
  2008-06-21 17:08                       ` Tom Tucker
  0 siblings, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2008-06-16 21:24 UTC (permalink / raw)
  To: Tom Tucker; +Cc: linux-nfs

On Thu, May 29, 2008 at 12:54:55PM -0500, Tom Tucker wrote:
> Create a kmem cache to hold WR contexts. Next we will convert
> the WR context get and put services to use this kmem cache.
> 
> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
> 
> ---
>  net/sunrpc/xprtrdma/svc_rdma.c |   24 ++++++++++++++++++++----
>  1 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
> index 1959a9d..8a7d34b 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
> @@ -69,8 +69,9 @@ atomic_t rdma_stat_rq_prod;
>  atomic_t rdma_stat_sq_poll;
>  atomic_t rdma_stat_sq_prod;
>  
> -/* Temporary NFS request map cache */
> +/* Temporary NFS request map and context caches */
>  struct kmem_cache *svc_rdma_map_cachep = NULL;
> +struct kmem_cache *svc_rdma_ctxt_cachep = NULL;

No need to initialize this one either.

>  
>  /*
>   * This function implements reading and resetting an atomic_t stat
> @@ -247,6 +248,8 @@ void svc_rdma_cleanup(void)
>  	svc_unreg_xprt_class(&svc_rdma_class);
>  	if (svc_rdma_map_cachep)
>  		kmem_cache_destroy(svc_rdma_map_cachep);
> +	if (svc_rdma_ctxt_cachep)
> +		kmem_cache_destroy(svc_rdma_ctxt_cachep);

And the conditional is probably unnecessary again.

--b.

>  }
>  
>  int svc_rdma_init(void)
> @@ -269,14 +272,27 @@ int svc_rdma_init(void)
>  						NULL);
>  	if (!svc_rdma_map_cachep) {
>  		printk(KERN_INFO "Could not allocate map cache.\n");
> -		goto err;
> +		goto err0;
> +	}
> +
> +	/* Create the temporary context cache */
> +	svc_rdma_ctxt_cachep =
> +		kmem_cache_create("svc_rdma_ctxt_cache",
> +				  sizeof(struct svc_rdma_op_ctxt),
> +				  0,
> +				  SLAB_HWCACHE_ALIGN,
> +				  NULL);
> +	if (!svc_rdma_ctxt_cachep) {
> +		printk(KERN_INFO "Could not allocate WR ctxt cache.\n");
> +		goto err1;
>  	}
>  
>  	/* Register RDMA with the SVC transport switch */
>  	svc_reg_xprt_class(&svc_rdma_class);
>  	return 0;
> -
> - err:
> + err1:
> +	kmem_cache_destroy(svc_rdma_map_cachep);
> + err0:
>  	if (svcrdma_table_header) {
>  		unregister_sysctl_table(svcrdma_table_header);
>  		svcrdma_table_header = NULL;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing
  2008-06-16 21:04     ` [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing J. Bruce Fields
@ 2008-06-21 16:26       ` Tom Tucker
  2008-06-23 18:21         ` J. Bruce Fields
  2008-06-21 16:51       ` Tom Tucker
  1 sibling, 1 reply; 19+ messages in thread
From: Tom Tucker @ 2008-06-21 16:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

J. Bruce Fields wrote:
> On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote:
>   
>> Use the new svc_rdma_req_map data type for mapping the client side memory
>> to the server side memory. Move the DMA mapping to the context pointed to
>> by each WR individually so that it is unmapped after the WR completes.
>>
>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
>>
>> ---
>>  net/sunrpc/xprtrdma/svc_rdma_sendto.c    |  122 ++++++++++++++----------------
>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    5 +-
>>  2 files changed, 62 insertions(+), 65 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> index fb82b1b..7cd65b7 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> @@ -64,10 +64,9 @@
>>   * SGE[sge_count-1]    data from xdr->tail.
>>   *
>>   */
>> -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>> -				 struct xdr_buf *xdr,
>> -				 struct ib_sge *sge,
>> -				 int *sge_count)
>> +static void xdr_to_sge(struct svcxprt_rdma *xprt,
>> +		       struct xdr_buf *xdr,
>> +		       struct svc_rdma_req_map *vec)
>>  {
>>  	/* Max we need is the length of the XDR / pagesize + one for
>>  	 * head + one for tail + one for RPCRDMA header
>> @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>  	sge_no = 1;
>>     
>
> Should the arrays in the svc_rdma_req_map actually be size
> RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header?
>   
Yes, I think this is a bug. I'm going to clamp max data in a different 
way in an upcomeing patch. So this will change in a subsequent patch.

> Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(),
> which is called from svc_sendto() immediately after setting:
>
> 	xb = &rqstp->rq_res;
> 	xb->len = xb->head[0].iov_len + 
> 		xb->page_len +
> 		xb->tail[0].iov_len;
>
> So I think xdr_to_sge() is doing a bunch of unnecessary work to handle
> the case where xdr->len could be less than that sum?
>
>   
Ok... check me below please.
>>  
>>  	/* Head SGE */
>> -	sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device,
>> -					     xdr->head[0].iov_base,
>> -					     xdr->head[0].iov_len,
>> -					     DMA_TO_DEVICE);
>> +	vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
>>  	sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len);
>>     

This doesn't need to be min_t. It could be xdr->head[0].iov_len.

>>  	byte_count -= sge_bytes;
>> -	sge[sge_no].length = sge_bytes;
>> -	sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>> +	vec->sge[sge_no].iov_len = sge_bytes;
>>  	sge_no++;
>>  
>>  	/* pages SGE */
>> @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>  	page_bytes = xdr->page_len;
>>  	page_off = xdr->page_base;
>>  	while (byte_count && page_bytes) {
>> +		vec->sge[sge_no].iov_base =
>> +			page_address(xdr->pages[page_no]) + page_off;
>>  		sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off));
>>     
This is still needed if the xdr terminates with a portion of a page and 
there is not tail.
>> -		sge[sge_no].addr =
>> -			ib_dma_map_page(xprt->sc_cm_id->device,
>> -					xdr->pages[page_no], page_off,
>> -					sge_bytes, DMA_TO_DEVICE);
>>  		sge_bytes = min(sge_bytes, page_bytes);
>>  		byte_count -= sge_bytes;
>>  		page_bytes -= sge_bytes;
>> -		sge[sge_no].length = sge_bytes;
>> -		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>> +		vec->sge[sge_no].iov_len = sge_bytes;
>>  
>>  		sge_no++;
>>  		page_no++;
>> @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>  
>>  	/* Tail SGE */
>>  	if (byte_count && xdr->tail[0].iov_len) {
>>     
This is defensive. It could just be byte_count.
If byte_count != xdr->tail[0].iov_len, the BUG_ON below will let us know.
>> -		sge[sge_no].addr =
>> -			ib_dma_map_single(xprt->sc_cm_id->device,
>> -					  xdr->tail[0].iov_base,
>> -					  xdr->tail[0].iov_len,
>> -					  DMA_TO_DEVICE);
>> +		vec->sge[sge_no].iov_base = xdr->tail[0].iov_base;
>>  		sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len);
>>     
The min_t isn't needed. It could just be byte_count.
>>  		byte_count -= sge_bytes;
>> -		sge[sge_no].length = sge_bytes;
>> -		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>> +		vec->sge[sge_no].iov_len = sge_bytes;
>>  		sge_no++;
>>  	}
>>  
>>  	BUG_ON(sge_no > sge_max);
>>  	BUG_ON(byte_count != 0);
>>  
>> -	*sge_count = sge_no;
>> -	return sge;
>> +	vec->count = sge_no;
>>  }
>>  
>>  
>> @@ -143,9 +129,8 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>  static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>>  		      u32 rmr, u64 to,
>>  		      u32 xdr_off, int write_len,
>> -		      struct ib_sge *xdr_sge, int sge_count)
>> +		      struct svc_rdma_req_map *vec)
>>  {
>> -	struct svc_rdma_op_ctxt *tmp_sge_ctxt;
>>  	struct ib_send_wr write_wr;
>>  	struct ib_sge *sge;
>>  	int xdr_sge_no;
>> @@ -156,23 +141,22 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>>  	struct svc_rdma_op_ctxt *ctxt;
>>  	int ret = 0;
>>  
>> -	BUG_ON(sge_count > RPCSVC_MAXPAGES);
>> +	BUG_ON(vec->count > RPCSVC_MAXPAGES);
>>  	dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, "
>> -		"write_len=%d, xdr_sge=%p, sge_count=%d\n",
>> +		"write_len=%d, vec->sge=%p, vec->count=%lu\n",
>>  		rmr, (unsigned long long)to, xdr_off,
>> -		write_len, xdr_sge, sge_count);
>> +		write_len, vec->sge, vec->count);
>>  
>>  	ctxt = svc_rdma_get_context(xprt);
>> -	ctxt->count = 0;
>> -	tmp_sge_ctxt = svc_rdma_get_context(xprt);
>> -	sge = tmp_sge_ctxt->sge;
>> +	ctxt->direction = DMA_TO_DEVICE;
>> +	sge = ctxt->sge;
>>  
>>  	/* Find the SGE associated with xdr_off */
>> -	for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < sge_count;
>> +	for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count;
>>  	     xdr_sge_no++) {
>> -		if (xdr_sge[xdr_sge_no].length > bc)
>> +		if (vec->sge[xdr_sge_no].iov_len > bc)
>>  			break;
>> -		bc -= xdr_sge[xdr_sge_no].length;
>> +		bc -= vec->sge[xdr_sge_no].iov_len;
>>  	}
>>  
>>  	sge_off = bc;
>> @@ -180,21 +164,27 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>>  	sge_no = 0;
>>  
>>  	/* Copy the remaining SGE */
>> -	while (bc != 0 && xdr_sge_no < sge_count) {
>> -		sge[sge_no].addr = xdr_sge[xdr_sge_no].addr + sge_off;
>> -		sge[sge_no].lkey = xdr_sge[xdr_sge_no].lkey;
>> +	while (bc != 0 && xdr_sge_no < vec->count) {
>> +		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>>  		sge_bytes = min((size_t)bc,
>> -				(size_t)(xdr_sge[xdr_sge_no].length-sge_off));
>> +				(size_t)(vec->sge[xdr_sge_no].iov_len-sge_off));
>>  		sge[sge_no].length = sge_bytes;
>> -
>> +		sge[sge_no].addr =
>> +			ib_dma_map_single(xprt->sc_cm_id->device,
>> +					  (void *)
>> +					  vec->sge[xdr_sge_no].iov_base + sge_off,
>> +					  sge_bytes, DMA_TO_DEVICE);
>> +		if (dma_mapping_error(sge[sge_no].addr))
>> +			return -EINVAL;
>>     
>
> We leak a svc_rdma_op_ctxt on error here.
>   
Ok.
> --b.
>
>   
>>  		sge_off = 0;
>>  		sge_no++;
>> +		ctxt->count++;
>>  		xdr_sge_no++;
>>  		bc -= sge_bytes;
>>  	}
>>  
>>  	BUG_ON(bc != 0);
>> -	BUG_ON(xdr_sge_no > sge_count);
>> +	BUG_ON(xdr_sge_no > vec->count);
>>  
>>  	/* Prepare WRITE WR */
>>  	memset(&write_wr, 0, sizeof write_wr);
>> @@ -210,11 +200,10 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>>  	/* Post It */
>>  	atomic_inc(&rdma_stat_write);
>>  	if (svc_rdma_send(xprt, &write_wr)) {
>> -		svc_rdma_put_context(ctxt, 1);
>> +		svc_rdma_put_context(ctxt, 0);
>>  		/* Fatal error, close transport */
>>  		ret = -EIO;
>>  	}
>> -	svc_rdma_put_context(tmp_sge_ctxt, 0);
>>  	return ret;
>>  }
>>  
>> @@ -222,8 +211,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
>>  			     struct rpcrdma_msg *rdma_argp,
>>  			     struct rpcrdma_msg *rdma_resp,
>>  			     struct svc_rqst *rqstp,
>> -			     struct ib_sge *sge,
>> -			     int sge_count)
>> +			     struct svc_rdma_req_map *vec)
>>  {
>>  	u32 xfer_len = rqstp->rq_res.page_len + rqstp->rq_res.tail[0].iov_len;
>>  	int write_len;
>> @@ -269,8 +257,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
>>  					 rs_offset + chunk_off,
>>  					 xdr_off,
>>  					 this_write,
>> -					 sge,
>> -					 sge_count);
>> +					 vec);
>>  			if (ret) {
>>  				dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n",
>>  					ret);
>> @@ -292,8 +279,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
>>  			     struct rpcrdma_msg *rdma_argp,
>>  			     struct rpcrdma_msg *rdma_resp,
>>  			     struct svc_rqst *rqstp,
>> -			     struct ib_sge *sge,
>> -			     int sge_count)
>> +			     struct svc_rdma_req_map *vec)
>>  {
>>  	u32 xfer_len = rqstp->rq_res.len;
>>  	int write_len;
>> @@ -341,8 +327,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
>>  					 rs_offset + chunk_off,
>>  					 xdr_off,
>>  					 this_write,
>> -					 sge,
>> -					 sge_count);
>> +					 vec);
>>  			if (ret) {
>>  				dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n",
>>  					ret);
>> @@ -380,7 +365,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>>  		      struct page *page,
>>  		      struct rpcrdma_msg *rdma_resp,
>>  		      struct svc_rdma_op_ctxt *ctxt,
>> -		      int sge_count,
>> +		      struct svc_rdma_req_map *vec,
>>  		      int byte_count)
>>  {
>>  	struct ib_send_wr send_wr;
>> @@ -413,10 +398,15 @@ static int send_reply(struct svcxprt_rdma *rdma,
>>  	ctxt->sge[0].lkey = rdma->sc_phys_mr->lkey;
>>  
>>  	/* Determine how many of our SGE are to be transmitted */
>> -	for (sge_no = 1; byte_count && sge_no < sge_count; sge_no++) {
>> -		sge_bytes = min((size_t)ctxt->sge[sge_no].length,
>> -				(size_t)byte_count);
>> +	for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) {
>> +		sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count);
>>  		byte_count -= sge_bytes;
>> +		ctxt->sge[sge_no].addr =
>> +			ib_dma_map_single(rdma->sc_cm_id->device,
>> +					  vec->sge[sge_no].iov_base,
>> +					  sge_bytes, DMA_TO_DEVICE);
>> +		ctxt->sge[sge_no].length = sge_bytes;
>> +		ctxt->sge[sge_no].lkey = rdma->sc_phys_mr->lkey;
>>  	}
>>  	BUG_ON(byte_count != 0);
>>  
>> @@ -428,8 +418,10 @@ static int send_reply(struct svcxprt_rdma *rdma,
>>  		ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
>>  		ctxt->count++;
>>  		rqstp->rq_respages[page_no] = NULL;
>> +		/* If there are more pages than SGE, terminate SGE list */
>> +		if (page_no+1 >= sge_no)
>> +			ctxt->sge[page_no+1].length = 0;
>>  	}
>> -
>>  	BUG_ON(sge_no > rdma->sc_max_sge);
>>  	memset(&send_wr, 0, sizeof send_wr);
>>  	ctxt->wr_op = IB_WR_SEND;
>> @@ -473,20 +465,20 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>>  	enum rpcrdma_proc reply_type;
>>  	int ret;
>>  	int inline_bytes;
>> -	struct ib_sge *sge;
>> -	int sge_count = 0;
>>  	struct page *res_page;
>>  	struct svc_rdma_op_ctxt *ctxt;
>> +	struct svc_rdma_req_map *vec;
>>  
>>  	dprintk("svcrdma: sending response for rqstp=%p\n", rqstp);
>>  
>>  	/* Get the RDMA request header. */
>>  	rdma_argp = xdr_start(&rqstp->rq_arg);
>>  
>> -	/* Build an SGE for the XDR */
>> +	/* Build an req vec for the XDR */
>>  	ctxt = svc_rdma_get_context(rdma);
>>  	ctxt->direction = DMA_TO_DEVICE;
>> -	sge = xdr_to_sge(rdma, &rqstp->rq_res, ctxt->sge, &sge_count);
>> +	vec = svc_rdma_get_req_map();
>> +	xdr_to_sge(rdma, &rqstp->rq_res, vec);
>>  
>>  	inline_bytes = rqstp->rq_res.len;
>>  
>> @@ -503,7 +495,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>>  
>>  	/* Send any write-chunk data and build resp write-list */
>>  	ret = send_write_chunks(rdma, rdma_argp, rdma_resp,
>> -				rqstp, sge, sge_count);
>> +				rqstp, vec);
>>  	if (ret < 0) {
>>  		printk(KERN_ERR "svcrdma: failed to send write chunks, rc=%d\n",
>>  		       ret);
>> @@ -513,7 +505,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>>  
>>  	/* Send any reply-list data and update resp reply-list */
>>  	ret = send_reply_chunks(rdma, rdma_argp, rdma_resp,
>> -				rqstp, sge, sge_count);
>> +				rqstp, vec);
>>  	if (ret < 0) {
>>  		printk(KERN_ERR "svcrdma: failed to send reply chunks, rc=%d\n",
>>  		       ret);
>> @@ -521,11 +513,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>>  	}
>>  	inline_bytes -= ret;
>>  
>> -	ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, sge_count,
>> +	ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec,
>>  			 inline_bytes);
>> +	svc_rdma_put_req_map(vec);
>>  	dprintk("svcrdma: send_reply returns %d\n", ret);
>>  	return ret;
>>   error:
>> +	svc_rdma_put_req_map(vec);
>>  	svc_rdma_put_context(ctxt, 0);
>>  	put_page(res_page);
>>  	return ret;
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index ae90758..fc86338 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -387,10 +387,13 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt)
>>  
>>  		switch (ctxt->wr_op) {
>>  		case IB_WR_SEND:
>> -		case IB_WR_RDMA_WRITE:
>>  			svc_rdma_put_context(ctxt, 1);
>>  			break;
>>  
>> +		case IB_WR_RDMA_WRITE:
>> +			svc_rdma_put_context(ctxt, 0);
>> +			break;
>> +
>>  		case IB_WR_RDMA_READ:
>>  			if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
>>  				struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
>>     
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping
  2008-06-16 19:48   ` [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping J. Bruce Fields
@ 2008-06-21 16:31     ` Tom Tucker
  2008-06-23 18:27       ` J. Bruce Fields
  0 siblings, 1 reply; 19+ messages in thread
From: Tom Tucker @ 2008-06-21 16:31 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

J. Bruce Fields wrote:
> On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote:
>   
>> Create a new data structure to hold the remote client address space
>> to local server address space mapping.
>>
>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
>>
>> ---
>>  include/linux/sunrpc/svc_rdma.h          |   27 +++++++++++++++++++++++++++
>>  net/sunrpc/xprtrdma/svc_rdma.c           |   23 +++++++++++++++++++++++
>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   26 ++++++++++++++++++++++++++
>>  3 files changed, 76 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>> index 05eb466..bd8749c 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt {
>>  	struct page *pages[RPCSVC_MAXPAGES];
>>  };
>>  
>> +/*
>> + * NFS_ requests are mapped on the client side by the chunk lists in
>> + * the RPCRDMA header. During the fetching of the RPC from the client
>> + * and the writing of the reply to the client, the memory in the
>> + * client and the memory in the server must be mapped as contiguous
>> + * vaddr/len for access by the hardware. These data strucures keep
>> + * these mappings.
>> + *
>> + * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the
>> + * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the
>> + * 'ch' field maps the read-list of the RPCRDMA header to the 'sge'
>> + * mapping of the reply.
>> + */
>> +struct svc_rdma_chunk_sge {
>> +	int start;		/* sge no for this chunk */
>> +	int count;		/* sge count for this chunk */
>> +};
>> +struct svc_rdma_req_map {
>> +	unsigned long count;
>> +	union {
>> +		struct kvec sge[RPCSVC_MAXPAGES];
>> +		struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
>> +	};
>> +};
>> +
>>  #define RDMACTXT_F_LAST_CTXT	2
>>  
>>  struct svcxprt_rdma {
>> @@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *);
>>  extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
>>  extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
>>  extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
>> +extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
>> +extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
>>  extern void svc_sq_reap(struct svcxprt_rdma *);
>>  extern void svc_rq_reap(struct svcxprt_rdma *);
>>  extern struct svc_xprt_class svc_rdma_class;
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
>> index 88c0ca2..545ea72 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
>> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod;
>>  atomic_t rdma_stat_sq_poll;
>>  atomic_t rdma_stat_sq_prod;
>>  
>> +/* Temporary NFS request map cache */
>> +struct kmem_cache *svc_rdma_map_cachep = NULL;
>>     
>
> No need to initialize globals to NULL.
>
>   
I thought only static objects were initialized per the C standard. Or 
are you saying that this particular
global doesn't need to be initialized because of the way it is used?
>> +
>>  /*
>>   * This function implements reading and resetting an atomic_t stat
>>   * variable through read/write to a proc file. Any write to the file
>> @@ -241,6 +244,8 @@ void svc_rdma_cleanup(void)
>>  		svcrdma_table_header = NULL;
>>  	}
>>  	svc_unreg_xprt_class(&svc_rdma_class);
>> +	if (svc_rdma_map_cachep)
>> +		kmem_cache_destroy(svc_rdma_map_cachep);
>>     
>
> By design this can't be NULL (and the code presumably would have oopsed
> much earlier if it were mistakenly allowed to be) so I'm inclined not to
> bother checking....
>
>   
Ok.
>>  }
>>  
>>  int svc_rdma_init(void)
>> @@ -255,9 +260,27 @@ int svc_rdma_init(void)
>>  		svcrdma_table_header =
>>  			register_sysctl_table(svcrdma_root_table);
>>  
>> +	/* Create the temporary map cache */
>> +	svc_rdma_map_cachep = kmem_cache_create("svc_rdma_map_cache",
>> +						sizeof(struct svc_rdma_req_map),
>> +						0,
>> +						SLAB_HWCACHE_ALIGN,
>> +						NULL);
>> +	if (!svc_rdma_map_cachep) {
>> +		printk(KERN_INFO "Could not allocate map cache.\n");
>> +		goto err;
>> +	}
>> +
>>  	/* Register RDMA with the SVC transport switch */
>>  	svc_reg_xprt_class(&svc_rdma_class);
>>  	return 0;
>> +
>> + err:
>> +	if (svcrdma_table_header) {
>> +		unregister_sysctl_table(svcrdma_table_header);
>>     
>
> unregister_sysctl_table() already handles the NULL case, so you could
> skip the if (svcrdma_table_header).
>   
Ok.
> --b.
>
>   
>> +		svcrdma_table_header = NULL;
>> +	}
>> +	return -ENOMEM;
>>  }
>>  MODULE_AUTHOR("Tom Tucker <tom@opengridcomputing.com>");
>>  MODULE_DESCRIPTION("SVC RDMA Transport");
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index e132509..ae90758 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -173,6 +173,32 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages)
>>  	atomic_dec(&xprt->sc_ctxt_used);
>>  }
>>  
>> +/* Temporary NFS request map cache. Created in svc_rdma.c  */
>> +extern struct kmem_cache *svc_rdma_map_cachep;
>> +
>> +/*
>> + * Temporary NFS req mappings are shared across all transport
>> + * instances. These are short lived and should be bounded by the number
>> + * of concurrent server threads * depth of the SQ.
>> + */
>> +struct svc_rdma_req_map *svc_rdma_get_req_map(void)
>> +{
>> +	struct svc_rdma_req_map *map;
>> +	while (1) {
>> +		map = kmem_cache_alloc(svc_rdma_map_cachep, GFP_KERNEL);
>> +		if (map)
>> +			break;
>> +		schedule_timeout_uninterruptible(msecs_to_jiffies(500));
>> +	}
>> +	map->count = 0;
>> +	return map;
>> +}
>> +
>> +void svc_rdma_put_req_map(struct svc_rdma_req_map *map)
>> +{
>> +	kmem_cache_free(svc_rdma_map_cachep, map);
>> +}
>> +
>>  /* ib_cq event handler */
>>  static void cq_event_handler(struct ib_event *event, void *context)
>>  {
>>     


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing
  2008-06-16 21:04     ` [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing J. Bruce Fields
  2008-06-21 16:26       ` Tom Tucker
@ 2008-06-21 16:51       ` Tom Tucker
  2008-06-23 18:51         ` J. Bruce Fields
  1 sibling, 1 reply; 19+ messages in thread
From: Tom Tucker @ 2008-06-21 16:51 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

J. Bruce Fields wrote:
> On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote:
>   
>> Use the new svc_rdma_req_map data type for mapping the client side memory
>> to the server side memory. Move the DMA mapping to the context pointed to
>> by each WR individually so that it is unmapped after the WR completes.
>>
>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
>>
>> ---
>>  net/sunrpc/xprtrdma/svc_rdma_sendto.c    |  122 ++++++++++++++----------------
>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    5 +-
>>  2 files changed, 62 insertions(+), 65 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> index fb82b1b..7cd65b7 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> @@ -64,10 +64,9 @@
>>   * SGE[sge_count-1]    data from xdr->tail.
>>   *
>>   */
>> -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>> -				 struct xdr_buf *xdr,
>> -				 struct ib_sge *sge,
>> -				 int *sge_count)
>> +static void xdr_to_sge(struct svcxprt_rdma *xprt,
>> +		       struct xdr_buf *xdr,
>> +		       struct svc_rdma_req_map *vec)
>>  {
>>  	/* Max we need is the length of the XDR / pagesize + one for
>>  	 * head + one for tail + one for RPCRDMA header
>> @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>  	sge_no = 1;
>>     
>
> Should the arrays in the svc_rdma_req_map actually be size
> RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header?
>
>   
Actually, upon further inspection, this is not a bug. The 
RPCSVC_MAXPAGES is used to size the rqstp
page array. This array includes a page for the request hdr, reply hdr 
and one extra page for alignment. For
our use case, the reply hdr page covers the RDMA header.

However, I think this payload length should be clamped in a different 
way. It seems silly to allocate these
things to handle big-data mounts that are rarely used in practice. I 
think it's better to have a transport max
that is more reasonable by default and a module parameter that is 
configurable if you're looking to do big-data
mounts.

Thoughts?
> Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(),
> which is called from svc_sendto() immediately after setting:
>
> 	xb = &rqstp->rq_res;
> 	xb->len = xb->head[0].iov_len + 
> 		xb->page_len +
> 		xb->tail[0].iov_len;
>
> So I think xdr_to_sge() is doing a bunch of unnecessary work to handle
> the case where xdr->len could be less than that sum?
>
>   
>>  
>>  	/* Head SGE */
>> -	sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device,
>> -					     xdr->head[0].iov_base,
>> -					     xdr->head[0].iov_len,
>> -					     DMA_TO_DEVICE);
>> +	vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
>>  	sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len);
>>  	byte_count -= sge_bytes;
>> -	sge[sge_no].length = sge_bytes;
>> -	sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>> +	vec->sge[sge_no].iov_len = sge_bytes;
>>  	sge_no++;
>>  
>>  	/* pages SGE */
>> @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>  	page_bytes = xdr->page_len;
>>  	page_off = xdr->page_base;
>>  	while (byte_count && page_bytes) {
>> +		vec->sge[sge_no].iov_base =
>> +			page_address(xdr->pages[page_no]) + page_off;
>>  		sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off));
>> -		sge[sge_no].addr =
>> -			ib_dma_map_page(xprt->sc_cm_id->device,
>> -					xdr->pages[page_no], page_off,
>> -					sge_bytes, DMA_TO_DEVICE);
>>  		sge_bytes = min(sge_bytes, page_bytes);
>>  		byte_count -= sge_bytes;
>>  		page_bytes -= sge_bytes;
>> -		sge[sge_no].length = sge_bytes;
>> -		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>> +		vec->sge[sge_no].iov_len = sge_bytes;
>>  
>>  		sge_no++;
>>  		page_no++;
>> @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>  
>>  	/* Tail SGE */
>>  	if (byte_count && xdr->tail[0].iov_len) {
>> -		sge[sge_no].addr =
>> -			ib_dma_map_single(xprt->sc_cm_id->device,
>> -					  xdr->tail[0].iov_base,
>> -					  xdr->tail[0].iov_len,
>> -					  DMA_TO_DEVICE);
>> +		vec->sge[sge_no].iov_base = xdr->tail[0].iov_base;
>>  		sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len);
>>  		byte_count -= sge_bytes;
>> -		sge[sge_no].length = sge_bytes;
>> -		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>> +		vec->sge[sge_no].iov_len = sge_bytes;
>>  		sge_no++;
>>  	}
>>  
>>  	BUG_ON(sge_no > sge_max);
>>  	BUG_ON(byte_count != 0);
>>  
>> -	*sge_count = sge_no;
>> -	return sge;
>> +	vec->count = sge_no;
>>  }
>>  
>>  
>> @@ -143,9 +129,8 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>  static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>>  		      u32 rmr, u64 to,
>>  		      u32 xdr_off, int write_len,
>> -		      struct ib_sge *xdr_sge, int sge_count)
>> +		      struct svc_rdma_req_map *vec)
>>  {
>> -	struct svc_rdma_op_ctxt *tmp_sge_ctxt;
>>  	struct ib_send_wr write_wr;
>>  	struct ib_sge *sge;
>>  	int xdr_sge_no;
>> @@ -156,23 +141,22 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>>  	struct svc_rdma_op_ctxt *ctxt;
>>  	int ret = 0;
>>  
>> -	BUG_ON(sge_count > RPCSVC_MAXPAGES);
>> +	BUG_ON(vec->count > RPCSVC_MAXPAGES);
>>  	dprintk("svcrdma: RDMA_WRITE rmr=%x, to=%llx, xdr_off=%d, "
>> -		"write_len=%d, xdr_sge=%p, sge_count=%d\n",
>> +		"write_len=%d, vec->sge=%p, vec->count=%lu\n",
>>  		rmr, (unsigned long long)to, xdr_off,
>> -		write_len, xdr_sge, sge_count);
>> +		write_len, vec->sge, vec->count);
>>  
>>  	ctxt = svc_rdma_get_context(xprt);
>> -	ctxt->count = 0;
>> -	tmp_sge_ctxt = svc_rdma_get_context(xprt);
>> -	sge = tmp_sge_ctxt->sge;
>> +	ctxt->direction = DMA_TO_DEVICE;
>> +	sge = ctxt->sge;
>>  
>>  	/* Find the SGE associated with xdr_off */
>> -	for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < sge_count;
>> +	for (bc = xdr_off, xdr_sge_no = 1; bc && xdr_sge_no < vec->count;
>>  	     xdr_sge_no++) {
>> -		if (xdr_sge[xdr_sge_no].length > bc)
>> +		if (vec->sge[xdr_sge_no].iov_len > bc)
>>  			break;
>> -		bc -= xdr_sge[xdr_sge_no].length;
>> +		bc -= vec->sge[xdr_sge_no].iov_len;
>>  	}
>>  
>>  	sge_off = bc;
>> @@ -180,21 +164,27 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>>  	sge_no = 0;
>>  
>>  	/* Copy the remaining SGE */
>> -	while (bc != 0 && xdr_sge_no < sge_count) {
>> -		sge[sge_no].addr = xdr_sge[xdr_sge_no].addr + sge_off;
>> -		sge[sge_no].lkey = xdr_sge[xdr_sge_no].lkey;
>> +	while (bc != 0 && xdr_sge_no < vec->count) {
>> +		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>>  		sge_bytes = min((size_t)bc,
>> -				(size_t)(xdr_sge[xdr_sge_no].length-sge_off));
>> +				(size_t)(vec->sge[xdr_sge_no].iov_len-sge_off));
>>  		sge[sge_no].length = sge_bytes;
>> -
>> +		sge[sge_no].addr =
>> +			ib_dma_map_single(xprt->sc_cm_id->device,
>> +					  (void *)
>> +					  vec->sge[xdr_sge_no].iov_base + sge_off,
>> +					  sge_bytes, DMA_TO_DEVICE);
>> +		if (dma_mapping_error(sge[sge_no].addr))
>> +			return -EINVAL;
>>     
>
> We leak a svc_rdma_op_ctxt on error here.
>
> --b.
>
>   
>>  		sge_off = 0;
>>  		sge_no++;
>> +		ctxt->count++;
>>  		xdr_sge_no++;
>>  		bc -= sge_bytes;
>>  	}
>>  
>>  	BUG_ON(bc != 0);
>> -	BUG_ON(xdr_sge_no > sge_count);
>> +	BUG_ON(xdr_sge_no > vec->count);
>>  
>>  	/* Prepare WRITE WR */
>>  	memset(&write_wr, 0, sizeof write_wr);
>> @@ -210,11 +200,10 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
>>  	/* Post It */
>>  	atomic_inc(&rdma_stat_write);
>>  	if (svc_rdma_send(xprt, &write_wr)) {
>> -		svc_rdma_put_context(ctxt, 1);
>> +		svc_rdma_put_context(ctxt, 0);
>>  		/* Fatal error, close transport */
>>  		ret = -EIO;
>>  	}
>> -	svc_rdma_put_context(tmp_sge_ctxt, 0);
>>  	return ret;
>>  }
>>  
>> @@ -222,8 +211,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
>>  			     struct rpcrdma_msg *rdma_argp,
>>  			     struct rpcrdma_msg *rdma_resp,
>>  			     struct svc_rqst *rqstp,
>> -			     struct ib_sge *sge,
>> -			     int sge_count)
>> +			     struct svc_rdma_req_map *vec)
>>  {
>>  	u32 xfer_len = rqstp->rq_res.page_len + rqstp->rq_res.tail[0].iov_len;
>>  	int write_len;
>> @@ -269,8 +257,7 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
>>  					 rs_offset + chunk_off,
>>  					 xdr_off,
>>  					 this_write,
>> -					 sge,
>> -					 sge_count);
>> +					 vec);
>>  			if (ret) {
>>  				dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n",
>>  					ret);
>> @@ -292,8 +279,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
>>  			     struct rpcrdma_msg *rdma_argp,
>>  			     struct rpcrdma_msg *rdma_resp,
>>  			     struct svc_rqst *rqstp,
>> -			     struct ib_sge *sge,
>> -			     int sge_count)
>> +			     struct svc_rdma_req_map *vec)
>>  {
>>  	u32 xfer_len = rqstp->rq_res.len;
>>  	int write_len;
>> @@ -341,8 +327,7 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
>>  					 rs_offset + chunk_off,
>>  					 xdr_off,
>>  					 this_write,
>> -					 sge,
>> -					 sge_count);
>> +					 vec);
>>  			if (ret) {
>>  				dprintk("svcrdma: RDMA_WRITE failed, ret=%d\n",
>>  					ret);
>> @@ -380,7 +365,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>>  		      struct page *page,
>>  		      struct rpcrdma_msg *rdma_resp,
>>  		      struct svc_rdma_op_ctxt *ctxt,
>> -		      int sge_count,
>> +		      struct svc_rdma_req_map *vec,
>>  		      int byte_count)
>>  {
>>  	struct ib_send_wr send_wr;
>> @@ -413,10 +398,15 @@ static int send_reply(struct svcxprt_rdma *rdma,
>>  	ctxt->sge[0].lkey = rdma->sc_phys_mr->lkey;
>>  
>>  	/* Determine how many of our SGE are to be transmitted */
>> -	for (sge_no = 1; byte_count && sge_no < sge_count; sge_no++) {
>> -		sge_bytes = min((size_t)ctxt->sge[sge_no].length,
>> -				(size_t)byte_count);
>> +	for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) {
>> +		sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count);
>>  		byte_count -= sge_bytes;
>> +		ctxt->sge[sge_no].addr =
>> +			ib_dma_map_single(rdma->sc_cm_id->device,
>> +					  vec->sge[sge_no].iov_base,
>> +					  sge_bytes, DMA_TO_DEVICE);
>> +		ctxt->sge[sge_no].length = sge_bytes;
>> +		ctxt->sge[sge_no].lkey = rdma->sc_phys_mr->lkey;
>>  	}
>>  	BUG_ON(byte_count != 0);
>>  
>> @@ -428,8 +418,10 @@ static int send_reply(struct svcxprt_rdma *rdma,
>>  		ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
>>  		ctxt->count++;
>>  		rqstp->rq_respages[page_no] = NULL;
>> +		/* If there are more pages than SGE, terminate SGE list */
>> +		if (page_no+1 >= sge_no)
>> +			ctxt->sge[page_no+1].length = 0;
>>  	}
>> -
>>  	BUG_ON(sge_no > rdma->sc_max_sge);
>>  	memset(&send_wr, 0, sizeof send_wr);
>>  	ctxt->wr_op = IB_WR_SEND;
>> @@ -473,20 +465,20 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>>  	enum rpcrdma_proc reply_type;
>>  	int ret;
>>  	int inline_bytes;
>> -	struct ib_sge *sge;
>> -	int sge_count = 0;
>>  	struct page *res_page;
>>  	struct svc_rdma_op_ctxt *ctxt;
>> +	struct svc_rdma_req_map *vec;
>>  
>>  	dprintk("svcrdma: sending response for rqstp=%p\n", rqstp);
>>  
>>  	/* Get the RDMA request header. */
>>  	rdma_argp = xdr_start(&rqstp->rq_arg);
>>  
>> -	/* Build an SGE for the XDR */
>> +	/* Build an req vec for the XDR */
>>  	ctxt = svc_rdma_get_context(rdma);
>>  	ctxt->direction = DMA_TO_DEVICE;
>> -	sge = xdr_to_sge(rdma, &rqstp->rq_res, ctxt->sge, &sge_count);
>> +	vec = svc_rdma_get_req_map();
>> +	xdr_to_sge(rdma, &rqstp->rq_res, vec);
>>  
>>  	inline_bytes = rqstp->rq_res.len;
>>  
>> @@ -503,7 +495,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>>  
>>  	/* Send any write-chunk data and build resp write-list */
>>  	ret = send_write_chunks(rdma, rdma_argp, rdma_resp,
>> -				rqstp, sge, sge_count);
>> +				rqstp, vec);
>>  	if (ret < 0) {
>>  		printk(KERN_ERR "svcrdma: failed to send write chunks, rc=%d\n",
>>  		       ret);
>> @@ -513,7 +505,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>>  
>>  	/* Send any reply-list data and update resp reply-list */
>>  	ret = send_reply_chunks(rdma, rdma_argp, rdma_resp,
>> -				rqstp, sge, sge_count);
>> +				rqstp, vec);
>>  	if (ret < 0) {
>>  		printk(KERN_ERR "svcrdma: failed to send reply chunks, rc=%d\n",
>>  		       ret);
>> @@ -521,11 +513,13 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
>>  	}
>>  	inline_bytes -= ret;
>>  
>> -	ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, sge_count,
>> +	ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec,
>>  			 inline_bytes);
>> +	svc_rdma_put_req_map(vec);
>>  	dprintk("svcrdma: send_reply returns %d\n", ret);
>>  	return ret;
>>   error:
>> +	svc_rdma_put_req_map(vec);
>>  	svc_rdma_put_context(ctxt, 0);
>>  	put_page(res_page);
>>  	return ret;
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index ae90758..fc86338 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -387,10 +387,13 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt)
>>  
>>  		switch (ctxt->wr_op) {
>>  		case IB_WR_SEND:
>> -		case IB_WR_RDMA_WRITE:
>>  			svc_rdma_put_context(ctxt, 1);
>>  			break;
>>  
>> +		case IB_WR_RDMA_WRITE:
>> +			svc_rdma_put_context(ctxt, 0);
>> +			break;
>> +
>>  		case IB_WR_RDMA_READ:
>>  			if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
>>  				struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
>>     
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts
  2008-06-16 21:24                     ` [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts J. Bruce Fields
@ 2008-06-21 17:08                       ` Tom Tucker
  0 siblings, 0 replies; 19+ messages in thread
From: Tom Tucker @ 2008-06-21 17:08 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

J. Bruce Fields wrote:
> On Thu, May 29, 2008 at 12:54:55PM -0500, Tom Tucker wrote:
>   
>> Create a kmem cache to hold WR contexts. Next we will convert
>> the WR context get and put services to use this kmem cache.
>>
>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
>>
>> ---
>>  net/sunrpc/xprtrdma/svc_rdma.c |   24 ++++++++++++++++++++----
>>  1 files changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
>> index 1959a9d..8a7d34b 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
>> @@ -69,8 +69,9 @@ atomic_t rdma_stat_rq_prod;
>>  atomic_t rdma_stat_sq_poll;
>>  atomic_t rdma_stat_sq_prod;
>>  
>> -/* Temporary NFS request map cache */
>> +/* Temporary NFS request map and context caches */
>>  struct kmem_cache *svc_rdma_map_cachep = NULL;
>> +struct kmem_cache *svc_rdma_ctxt_cachep = NULL;
>>     
>
> No need to initialize this one either.
>
>   
Given the way it's used, agreed.
>>  
>>  /*
>>   * This function implements reading and resetting an atomic_t stat
>> @@ -247,6 +248,8 @@ void svc_rdma_cleanup(void)
>>  	svc_unreg_xprt_class(&svc_rdma_class);
>>  	if (svc_rdma_map_cachep)
>>  		kmem_cache_destroy(svc_rdma_map_cachep);
>> +	if (svc_rdma_ctxt_cachep)
>> +		kmem_cache_destroy(svc_rdma_ctxt_cachep);
>>     
>
> And the conditional is probably unnecessary again.
>
>   
ok.
> --b.
>
>   
>>  }
>>  
>>  int svc_rdma_init(void)
>> @@ -269,14 +272,27 @@ int svc_rdma_init(void)
>>  						NULL);
>>  	if (!svc_rdma_map_cachep) {
>>  		printk(KERN_INFO "Could not allocate map cache.\n");
>> -		goto err;
>> +		goto err0;
>> +	}
>> +
>> +	/* Create the temporary context cache */
>> +	svc_rdma_ctxt_cachep =
>> +		kmem_cache_create("svc_rdma_ctxt_cache",
>> +				  sizeof(struct svc_rdma_op_ctxt),
>> +				  0,
>> +				  SLAB_HWCACHE_ALIGN,
>> +				  NULL);
>> +	if (!svc_rdma_ctxt_cachep) {
>> +		printk(KERN_INFO "Could not allocate WR ctxt cache.\n");
>> +		goto err1;
>>  	}
>>  
>>  	/* Register RDMA with the SVC transport switch */
>>  	svc_reg_xprt_class(&svc_rdma_class);
>>  	return 0;
>> -
>> - err:
>> + err1:
>> +	kmem_cache_destroy(svc_rdma_map_cachep);
>> + err0:
>>  	if (svcrdma_table_header) {
>>  		unregister_sysctl_table(svcrdma_table_header);
>>  		svcrdma_table_header = NULL;
>>     


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing
  2008-06-21 16:26       ` Tom Tucker
@ 2008-06-23 18:21         ` J. Bruce Fields
  2008-06-24  2:29           ` Tom Tucker
  0 siblings, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2008-06-23 18:21 UTC (permalink / raw)
  To: Tom Tucker; +Cc: linux-nfs

On Sat, Jun 21, 2008 at 11:26:52AM -0500, Tom Tucker wrote:
> J. Bruce Fields wrote:
>> Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(),
>> which is called from svc_sendto() immediately after setting:
>>
>> 	xb = &rqstp->rq_res;
>> 	xb->len = xb->head[0].iov_len + 		xb->page_len +
>> 		xb->tail[0].iov_len;
>>
>> So I think xdr_to_sge() is doing a bunch of unnecessary work to handle
>> the case where xdr->len could be less than that sum?
>>
>>   
> Ok... check me below please.
>>>   	/* Head SGE */
>>> -	sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device,
>>> -					     xdr->head[0].iov_base,
>>> -					     xdr->head[0].iov_len,
>>> -					     DMA_TO_DEVICE);
>>> +	vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
>>>  	sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len);
>>>     
>
> This doesn't need to be min_t. It could be xdr->head[0].iov_len.

Yes.  The variable byte_count shouldn't be needed at all, so I'd start
by eliminating all references to byte_count.

>
>>>  	byte_count -= sge_bytes;
>>> -	sge[sge_no].length = sge_bytes;
>>> -	sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>>> +	vec->sge[sge_no].iov_len = sge_bytes;
>>>  	sge_no++;
>>>   	/* pages SGE */
>>> @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>>  	page_bytes = xdr->page_len;
>>>  	page_off = xdr->page_base;
>>>  	while (byte_count && page_bytes) {
>>> +		vec->sge[sge_no].iov_base =
>>> +			page_address(xdr->pages[page_no]) + page_off;
>>>  		sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off));
>>>     
> This is still needed if the xdr terminates with a portion of a page and  
> there is not tail.

Yes, but the condition could just be (page_bytes).

>>> -		sge[sge_no].addr =
>>> -			ib_dma_map_page(xprt->sc_cm_id->device,
>>> -					xdr->pages[page_no], page_off,
>>> -					sge_bytes, DMA_TO_DEVICE);
>>>  		sge_bytes = min(sge_bytes, page_bytes);
>>>  		byte_count -= sge_bytes;
>>>  		page_bytes -= sge_bytes;
>>> -		sge[sge_no].length = sge_bytes;
>>> -		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>>> +		vec->sge[sge_no].iov_len = sge_bytes;
>>>   		sge_no++;
>>>  		page_no++;
>>> @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>>   	/* Tail SGE */
>>>  	if (byte_count && xdr->tail[0].iov_len) {
>>>     
> This is defensive. It could just be byte_count.

I agree, though I'd just make it (xdr->tail[0].iov_len) instead of just
(byte_count).

> If byte_count != xdr->tail[0].iov_len, the BUG_ON below will let us know.
>>> -		sge[sge_no].addr =
>>> -			ib_dma_map_single(xprt->sc_cm_id->device,
>>> -					  xdr->tail[0].iov_base,
>>> -					  xdr->tail[0].iov_len,
>>> -					  DMA_TO_DEVICE);
>>> +		vec->sge[sge_no].iov_base = xdr->tail[0].iov_base;
>>>  		sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len);
>>>     
> The min_t isn't needed. It could just be byte_count.

Right, but could also just be tail length, as above.

--b.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping
  2008-06-21 16:31     ` Tom Tucker
@ 2008-06-23 18:27       ` J. Bruce Fields
  2008-06-24  2:58         ` Tom Tucker
  0 siblings, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2008-06-23 18:27 UTC (permalink / raw)
  To: Tom Tucker; +Cc: linux-nfs

On Sat, Jun 21, 2008 at 11:31:43AM -0500, Tom Tucker wrote:
> J. Bruce Fields wrote:
>> On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote:
>>   
>>> Create a new data structure to hold the remote client address space
>>> to local server address space mapping.
>>>
>>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
>>>
>>> ---
>>>  include/linux/sunrpc/svc_rdma.h          |   27 +++++++++++++++++++++++++++
>>>  net/sunrpc/xprtrdma/svc_rdma.c           |   23 +++++++++++++++++++++++
>>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   26 ++++++++++++++++++++++++++
>>>  3 files changed, 76 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>>> index 05eb466..bd8749c 100644
>>> --- a/include/linux/sunrpc/svc_rdma.h
>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>> @@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt {
>>>  	struct page *pages[RPCSVC_MAXPAGES];
>>>  };
>>>  +/*
>>> + * NFS_ requests are mapped on the client side by the chunk lists in
>>> + * the RPCRDMA header. During the fetching of the RPC from the client
>>> + * and the writing of the reply to the client, the memory in the
>>> + * client and the memory in the server must be mapped as contiguous
>>> + * vaddr/len for access by the hardware. These data strucures keep
>>> + * these mappings.
>>> + *
>>> + * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the
>>> + * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the
>>> + * 'ch' field maps the read-list of the RPCRDMA header to the 'sge'
>>> + * mapping of the reply.
>>> + */
>>> +struct svc_rdma_chunk_sge {
>>> +	int start;		/* sge no for this chunk */
>>> +	int count;		/* sge count for this chunk */
>>> +};
>>> +struct svc_rdma_req_map {
>>> +	unsigned long count;
>>> +	union {
>>> +		struct kvec sge[RPCSVC_MAXPAGES];
>>> +		struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
>>> +	};
>>> +};
>>> +
>>>  #define RDMACTXT_F_LAST_CTXT	2
>>>   struct svcxprt_rdma {
>>> @@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *);
>>>  extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
>>>  extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
>>>  extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
>>> +extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
>>> +extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
>>>  extern void svc_sq_reap(struct svcxprt_rdma *);
>>>  extern void svc_rq_reap(struct svcxprt_rdma *);
>>>  extern struct svc_xprt_class svc_rdma_class;
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
>>> index 88c0ca2..545ea72 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
>>> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod;
>>>  atomic_t rdma_stat_sq_poll;
>>>  atomic_t rdma_stat_sq_prod;
>>>  +/* Temporary NFS request map cache */
>>> +struct kmem_cache *svc_rdma_map_cachep = NULL;
>>>     
>>
>> No need to initialize globals to NULL.
>>
>>   
> I thought only static objects were initialized per the C standard. Or  
> are you saying that this particular
> global doesn't need to be initialized because of the way it is used?

I don't know if the initialization is mandated by the standards or
whether it's just gcc behavior, but I know I get a complaint every time
somebody finds one of those....

That may partly just be a preference for conciseness, but I think it may
also allow gcc to put the thing in a different section and save some
space in the on-disk kernel--I don't know.

--b.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing
  2008-06-21 16:51       ` Tom Tucker
@ 2008-06-23 18:51         ` J. Bruce Fields
  2008-06-24  3:02           ` Tom Tucker
  0 siblings, 1 reply; 19+ messages in thread
From: J. Bruce Fields @ 2008-06-23 18:51 UTC (permalink / raw)
  To: Tom Tucker; +Cc: linux-nfs

On Sat, Jun 21, 2008 at 11:51:18AM -0500, Tom Tucker wrote:
> J. Bruce Fields wrote:
>> On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote:
>>   
>>> Use the new svc_rdma_req_map data type for mapping the client side memory
>>> to the server side memory. Move the DMA mapping to the context pointed to
>>> by each WR individually so that it is unmapped after the WR completes.
>>>
>>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
>>>
>>> ---
>>>  net/sunrpc/xprtrdma/svc_rdma_sendto.c    |  122 ++++++++++++++----------------
>>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    5 +-
>>>  2 files changed, 62 insertions(+), 65 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>> index fb82b1b..7cd65b7 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>> @@ -64,10 +64,9 @@
>>>   * SGE[sge_count-1]    data from xdr->tail.
>>>   *
>>>   */
>>> -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>> -				 struct xdr_buf *xdr,
>>> -				 struct ib_sge *sge,
>>> -				 int *sge_count)
>>> +static void xdr_to_sge(struct svcxprt_rdma *xprt,
>>> +		       struct xdr_buf *xdr,
>>> +		       struct svc_rdma_req_map *vec)
>>>  {
>>>  	/* Max we need is the length of the XDR / pagesize + one for
>>>  	 * head + one for tail + one for RPCRDMA header
>>> @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>>  	sge_no = 1;
>>>     
>>
>> Should the arrays in the svc_rdma_req_map actually be size
>> RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header?
>>
>>   
> Actually, upon further inspection, this is not a bug. The  
> RPCSVC_MAXPAGES is used to size the rqstp
> page array. This array includes a page for the request hdr, reply hdr  
> and one extra page for alignment. For
> our use case, the reply hdr page covers the RDMA header.

OK, that makes sense.  (Let's just fix the comment above so I don't
forget and ask the same question again next time.)

> However, I think this payload length should be clamped in a different  
> way. It seems silly to allocate these
> things to handle big-data mounts that are rarely used in practice. I  
> think it's better to have a transport max
> that is more reasonable by default and a module parameter that is  
> configurable if you're looking to do big-data
> mounts.
>
> Thoughts?

This is decided in fs/nfsd/nfssvc.c:nfsd_create_serv(), based on the
machine's ram size, and can be overridden with the nfsd filesystem's
maxblksize parameter.

I'm open to suggestions, but my feeling is that it's hard to educate
people about how to use additional sysctl's, and that it's reasonable to
waste some memory to get good out-of-the-box performance on streaming
IO.

See also recent patches from Olga and Dean for similar discussion
(about WAN tcp performance in their case, since the size of these
buffers affects the size of the tcp window we advertise).

--b.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing
  2008-06-23 18:21         ` J. Bruce Fields
@ 2008-06-24  2:29           ` Tom Tucker
  0 siblings, 0 replies; 19+ messages in thread
From: Tom Tucker @ 2008-06-24  2:29 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

J. Bruce Fields wrote:
> On Sat, Jun 21, 2008 at 11:26:52AM -0500, Tom Tucker wrote:
>   
>> J. Bruce Fields wrote:
>>     
>>> Also, the only caller of xdr_to_sge() appears to be svc_rdma_sendto(),
>>> which is called from svc_sendto() immediately after setting:
>>>
>>> 	xb = &rqstp->rq_res;
>>> 	xb->len = xb->head[0].iov_len + 		xb->page_len +
>>> 		xb->tail[0].iov_len;
>>>
>>> So I think xdr_to_sge() is doing a bunch of unnecessary work to handle
>>> the case where xdr->len could be less than that sum?
>>>
>>>   
>>>       
>> Ok... check me below please.
>>     
>>>>   	/* Head SGE */
>>>> -	sge[sge_no].addr = ib_dma_map_single(xprt->sc_cm_id->device,
>>>> -					     xdr->head[0].iov_base,
>>>> -					     xdr->head[0].iov_len,
>>>> -					     DMA_TO_DEVICE);
>>>> +	vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
>>>>  	sge_bytes = min_t(u32, byte_count, xdr->head[0].iov_len);
>>>>     
>>>>         
>> This doesn't need to be min_t. It could be xdr->head[0].iov_len.
>>     
>
> Yes.  The variable byte_count shouldn't be needed at all, so I'd start
> by eliminating all references to byte_count.
>
>   
>>>>  	byte_count -= sge_bytes;
>>>> -	sge[sge_no].length = sge_bytes;
>>>> -	sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>>>> +	vec->sge[sge_no].iov_len = sge_bytes;
>>>>  	sge_no++;
>>>>   	/* pages SGE */
>>>> @@ -99,16 +94,13 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>>>  	page_bytes = xdr->page_len;
>>>>  	page_off = xdr->page_base;
>>>>  	while (byte_count && page_bytes) {
>>>> +		vec->sge[sge_no].iov_base =
>>>> +			page_address(xdr->pages[page_no]) + page_off;
>>>>  		sge_bytes = min_t(u32, byte_count, (PAGE_SIZE-page_off));
>>>>     
>>>>         
>> This is still needed if the xdr terminates with a portion of a page and  
>> there is not tail.
>>     
>
> Yes, but the condition could just be (page_bytes).
>
>   
Right.
>>>> -		sge[sge_no].addr =
>>>> -			ib_dma_map_page(xprt->sc_cm_id->device,
>>>> -					xdr->pages[page_no], page_off,
>>>> -					sge_bytes, DMA_TO_DEVICE);
>>>>  		sge_bytes = min(sge_bytes, page_bytes);
>>>>  		byte_count -= sge_bytes;
>>>>  		page_bytes -= sge_bytes;
>>>> -		sge[sge_no].length = sge_bytes;
>>>> -		sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
>>>> +		vec->sge[sge_no].iov_len = sge_bytes;
>>>>   		sge_no++;
>>>>  		page_no++;
>>>> @@ -117,23 +109,17 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>>>   	/* Tail SGE */
>>>>  	if (byte_count && xdr->tail[0].iov_len) {
>>>>     
>>>>         
>> This is defensive. It could just be byte_count.
>>     
>
> I agree, though I'd just make it (xdr->tail[0].iov_len) instead of just
> (byte_count).
>
>   
Agreed.
>> If byte_count != xdr->tail[0].iov_len, the BUG_ON below will let us know.
>>     
>>>> -		sge[sge_no].addr =
>>>> -			ib_dma_map_single(xprt->sc_cm_id->device,
>>>> -					  xdr->tail[0].iov_base,
>>>> -					  xdr->tail[0].iov_len,
>>>> -					  DMA_TO_DEVICE);
>>>> +		vec->sge[sge_no].iov_base = xdr->tail[0].iov_base;
>>>>  		sge_bytes = min_t(u32, byte_count, xdr->tail[0].iov_len);
>>>>     
>>>>         
>> The min_t isn't needed. It could just be byte_count.
>>     
>
> Right, but could also just be tail length, as above.
>
>   

Ok, page # confirmed...

> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping
  2008-06-23 18:27       ` J. Bruce Fields
@ 2008-06-24  2:58         ` Tom Tucker
  2008-06-24 19:58           ` J. Bruce Fields
  0 siblings, 1 reply; 19+ messages in thread
From: Tom Tucker @ 2008-06-24  2:58 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

J. Bruce Fields wrote:
> On Sat, Jun 21, 2008 at 11:31:43AM -0500, Tom Tucker wrote:
>   
>> J. Bruce Fields wrote:
>>     
>>> On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote:
>>>   
>>>       
>>>> Create a new data structure to hold the remote client address space
>>>> to local server address space mapping.
>>>>
>>>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
>>>>
>>>> ---
>>>>  include/linux/sunrpc/svc_rdma.h          |   27 +++++++++++++++++++++++++++
>>>>  net/sunrpc/xprtrdma/svc_rdma.c           |   23 +++++++++++++++++++++++
>>>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   26 ++++++++++++++++++++++++++
>>>>  3 files changed, 76 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>>>> index 05eb466..bd8749c 100644
>>>> --- a/include/linux/sunrpc/svc_rdma.h
>>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>>> @@ -86,6 +86,31 @@ struct svc_rdma_op_ctxt {
>>>>  	struct page *pages[RPCSVC_MAXPAGES];
>>>>  };
>>>>  +/*
>>>> + * NFS_ requests are mapped on the client side by the chunk lists in
>>>> + * the RPCRDMA header. During the fetching of the RPC from the client
>>>> + * and the writing of the reply to the client, the memory in the
>>>> + * client and the memory in the server must be mapped as contiguous
>>>> + * vaddr/len for access by the hardware. These data strucures keep
>>>> + * these mappings.
>>>> + *
>>>> + * For an RDMA_WRITE, the 'sge' maps the RPC REPLY. For RDMA_READ, the
>>>> + * 'sge' in the svc_rdma_req_map maps the server side RPC reply and the
>>>> + * 'ch' field maps the read-list of the RPCRDMA header to the 'sge'
>>>> + * mapping of the reply.
>>>> + */
>>>> +struct svc_rdma_chunk_sge {
>>>> +	int start;		/* sge no for this chunk */
>>>> +	int count;		/* sge count for this chunk */
>>>> +};
>>>> +struct svc_rdma_req_map {
>>>> +	unsigned long count;
>>>> +	union {
>>>> +		struct kvec sge[RPCSVC_MAXPAGES];
>>>> +		struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
>>>> +	};
>>>> +};
>>>> +
>>>>  #define RDMACTXT_F_LAST_CTXT	2
>>>>   struct svcxprt_rdma {
>>>> @@ -173,6 +198,8 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *);
>>>>  extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
>>>>  extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
>>>>  extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
>>>> +extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
>>>> +extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
>>>>  extern void svc_sq_reap(struct svcxprt_rdma *);
>>>>  extern void svc_rq_reap(struct svcxprt_rdma *);
>>>>  extern struct svc_xprt_class svc_rdma_class;
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
>>>> index 88c0ca2..545ea72 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
>>>> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod;
>>>>  atomic_t rdma_stat_sq_poll;
>>>>  atomic_t rdma_stat_sq_prod;
>>>>  +/* Temporary NFS request map cache */
>>>> +struct kmem_cache *svc_rdma_map_cachep = NULL;
>>>>     
>>>>         
>>> No need to initialize globals to NULL.
>>>
>>>   
>>>       
>> I thought only static objects were initialized per the C standard. Or  
>> are you saying that this particular
>> global doesn't need to be initialized because of the way it is used?
>>     
>
> I don't know if the initialization is mandated by the standards or
> whether it's just gcc behavior, but I know I get a complaint every time
> somebody finds one of those....
>
>   

In this case, the initialization is unnecessary, so it can be safely dumped.

> That may partly just be a preference for conciseness, but I think it may
> also allow gcc to put the thing in a different section and save some
> space in the on-disk kernel--I don't know.
>
>   
Right -- un-initialized data goes in a different section, the .bss 
section in particular. Since the .bss section is not "initialized", 
there are no values (zeroes in this case) sitting in the object file 
ready to be mapped into whatever location becomes .bss. By contrast, the 
.data section contains initialized data and the initial values are 
sitting in the object file.

So my question here is a little subtler:

A. Do we discard _all_ zero initializations of non-static globals 
because we can safely assume that .bss is initialzed (not a fan of 
this),  or

B. Don't be an idiot and initialize objects unnecessarily because it 
bloats the kernel object file?

An idiot voting for  B,
Tom

> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing
  2008-06-23 18:51         ` J. Bruce Fields
@ 2008-06-24  3:02           ` Tom Tucker
  0 siblings, 0 replies; 19+ messages in thread
From: Tom Tucker @ 2008-06-24  3:02 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

J. Bruce Fields wrote:
> On Sat, Jun 21, 2008 at 11:51:18AM -0500, Tom Tucker wrote:
>   
>> J. Bruce Fields wrote:
>>     
>>> On Thu, May 29, 2008 at 12:54:47PM -0500, Tom Tucker wrote:
>>>   
>>>       
>>>> Use the new svc_rdma_req_map data type for mapping the client side memory
>>>> to the server side memory. Move the DMA mapping to the context pointed to
>>>> by each WR individually so that it is unmapped after the WR completes.
>>>>
>>>> Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
>>>>
>>>> ---
>>>>  net/sunrpc/xprtrdma/svc_rdma_sendto.c    |  122 ++++++++++++++----------------
>>>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    5 +-
>>>>  2 files changed, 62 insertions(+), 65 deletions(-)
>>>>
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> index fb82b1b..7cd65b7 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> @@ -64,10 +64,9 @@
>>>>   * SGE[sge_count-1]    data from xdr->tail.
>>>>   *
>>>>   */
>>>> -static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>>> -				 struct xdr_buf *xdr,
>>>> -				 struct ib_sge *sge,
>>>> -				 int *sge_count)
>>>> +static void xdr_to_sge(struct svcxprt_rdma *xprt,
>>>> +		       struct xdr_buf *xdr,
>>>> +		       struct svc_rdma_req_map *vec)
>>>>  {
>>>>  	/* Max we need is the length of the XDR / pagesize + one for
>>>>  	 * head + one for tail + one for RPCRDMA header
>>>> @@ -84,14 +83,10 @@ static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
>>>>  	sge_no = 1;
>>>>     
>>>>         
>>> Should the arrays in the svc_rdma_req_map actually be size
>>> RPC_SVC_MAXPAGES+1 to allow for the RPCRDMA header?
>>>
>>>   
>>>       
>> Actually, upon further inspection, this is not a bug. The  
>> RPCSVC_MAXPAGES is used to size the rqstp
>> page array. This array includes a page for the request hdr, reply hdr  
>> and one extra page for alignment. For
>> our use case, the reply hdr page covers the RDMA header.
>>     
>
> OK, that makes sense.  (Let's just fix the comment above so I don't
> forget and ask the same question again next time.)
>
>   
Ok, done.
>> However, I think this payload length should be clamped in a different  
>> way. It seems silly to allocate these
>> things to handle big-data mounts that are rarely used in practice. I  
>> think it's better to have a transport max
>> that is more reasonable by default and a module parameter that is  
>> configurable if you're looking to do big-data
>> mounts.
>>
>> Thoughts?
>>     
>
> This is decided in fs/nfsd/nfssvc.c:nfsd_create_serv(), based on the
> machine's ram size, and can be overridden with the nfsd filesystem's
> maxblksize parameter.
>
> I'm open to suggestions, but my feeling is that it's hard to educate
> people about how to use additional sysctl's, and that it's reasonable to
> waste some memory to get good out-of-the-box performance on streaming
> IO.
>
> See also recent patches from Olga and Dean for similar discussion
> (about WAN tcp performance in their case, since the size of these
> buffers affects the size of the tcp window we advertise).
>
> --b.
>   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping
  2008-06-24  2:58         ` Tom Tucker
@ 2008-06-24 19:58           ` J. Bruce Fields
  2008-06-24 20:31             ` Benny Halevy
  2008-06-24 20:38             ` Trond Myklebust
  0 siblings, 2 replies; 19+ messages in thread
From: J. Bruce Fields @ 2008-06-24 19:58 UTC (permalink / raw)
  To: Tom Tucker; +Cc: linux-nfs

On Mon, Jun 23, 2008 at 09:58:21PM -0500, Tom Tucker wrote:
> J. Bruce Fields wrote:
>> On Sat, Jun 21, 2008 at 11:31:43AM -0500, Tom Tucker wrote:
>>   
>>> J. Bruce Fields wrote:
>>>     
>>>> On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote:
>>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
>>>>> index 88c0ca2..545ea72 100644
>>>>> --- a/net/sunrpc/xprtrdma/svc_rdma.c
>>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
>>>>> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod;
>>>>>  atomic_t rdma_stat_sq_poll;
>>>>>  atomic_t rdma_stat_sq_prod;
>>>>>  +/* Temporary NFS request map cache */
>>>>> +struct kmem_cache *svc_rdma_map_cachep = NULL;
>>>>>             
>>>> No need to initialize globals to NULL.
>>>>
>>>>         
>>> I thought only static objects were initialized per the C standard. Or 
>>>  are you saying that this particular
>>> global doesn't need to be initialized because of the way it is used?
>>>     
>>
>> I don't know if the initialization is mandated by the standards or
>> whether it's just gcc behavior, but I know I get a complaint every time
>> somebody finds one of those....
>>
>>   
>
> In this case, the initialization is unnecessary, so it can be safely dumped.
>
>> That may partly just be a preference for conciseness, but I think it may
>> also allow gcc to put the thing in a different section and save some
>> space in the on-disk kernel--I don't know.
>>
>>   
> Right -- un-initialized data goes in a different section, the .bss  
> section in particular. Since the .bss section is not "initialized",  
> there are no values (zeroes in this case) sitting in the object file  
> ready to be mapped into whatever location becomes .bss. By contrast, the  
> .data section contains initialized data and the initial values are  
> sitting in the object file.
>
> So my question here is a little subtler:
>
> A. Do we discard _all_ zero initializations of non-static globals  
> because we can safely assume that .bss is initialzed (not a fan of  
> this),  or
>
> B. Don't be an idiot and initialize objects unnecessarily because it  
> bloats the kernel object file?
>
> An idiot voting for  B,

My understanding is that it's A--e.g. checkpatch.pl has a dumb regex
that just checks for these assignments and whines about them however
they're used.

Google doesn't find me any more detailed discussion of the policy.

--b.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping
  2008-06-24 19:58           ` J. Bruce Fields
@ 2008-06-24 20:31             ` Benny Halevy
  2008-06-24 20:38             ` Trond Myklebust
  1 sibling, 0 replies; 19+ messages in thread
From: Benny Halevy @ 2008-06-24 20:31 UTC (permalink / raw)
  To: J. Bruce Fields, Tom Tucker; +Cc: linux-nfs

J. Bruce Fields wrote:
> On Mon, Jun 23, 2008 at 09:58:21PM -0500, Tom Tucker wrote:
>> J. Bruce Fields wrote:
>>> On Sat, Jun 21, 2008 at 11:31:43AM -0500, Tom Tucker wrote:
>>>   
>>>> J. Bruce Fields wrote:
>>>>     
>>>>> On Thu, May 29, 2008 at 12:54:46PM -0500, Tom Tucker wrote:
>>>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
>>>>>> index 88c0ca2..545ea72 100644
>>>>>> --- a/net/sunrpc/xprtrdma/svc_rdma.c
>>>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
>>>>>> @@ -69,6 +69,9 @@ atomic_t rdma_stat_rq_prod;
>>>>>>  atomic_t rdma_stat_sq_poll;
>>>>>>  atomic_t rdma_stat_sq_prod;
>>>>>>  +/* Temporary NFS request map cache */
>>>>>> +struct kmem_cache *svc_rdma_map_cachep = NULL;
>>>>>>             
>>>>> No need to initialize globals to NULL.
>>>>>
>>>>>         
>>>> I thought only static objects were initialized per the C standard. Or 
>>>>  are you saying that this particular
>>>> global doesn't need to be initialized because of the way it is used?
>>>>     
>>> I don't know if the initialization is mandated by the standards or
>>> whether it's just gcc behavior, but I know I get a complaint every time
>>> somebody finds one of those....

The implicit initialization to zero was defined back then by K&R:
http://www.cs.utah.edu/~phister/K_n_R/chapter4.html#s4.9
"In the absence of explicit initialization, external and static variables
are guaranteed to be initialized to zero"

AFAIK this convention was kept by the C standards ever since.

>>>
>>>   
>> In this case, the initialization is unnecessary, so it can be safely dumped.
>>
>>> That may partly just be a preference for conciseness, but I think it may
>>> also allow gcc to put the thing in a different section and save some
>>> space in the on-disk kernel--I don't know.
>>>
>>>   
>> Right -- un-initialized data goes in a different section, the .bss  
>> section in particular. Since the .bss section is not "initialized",  
>> there are no values (zeroes in this case) sitting in the object file  
>> ready to be mapped into whatever location becomes .bss. By contrast, the  
>> .data section contains initialized data and the initial values are  
>> sitting in the object file.
>>
>> So my question here is a little subtler:
>>
>> A. Do we discard _all_ zero initializations of non-static globals  
>> because we can safely assume that .bss is initialzed (not a fan of  
>> this),  or

Also static locals...  What matters is that the variable has
static allocation, its scope is irrelevant to the initialization
issue.

Benny

>>
>> B. Don't be an idiot and initialize objects unnecessarily because it  
>> bloats the kernel object file?
>>
>> An idiot voting for  B,
> 
> My understanding is that it's A--e.g. checkpatch.pl has a dumb regex
> that just checks for these assignments and whines about them however
> they're used.
> 
> Google doesn't find me any more detailed discussion of the policy.
> 
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping
  2008-06-24 19:58           ` J. Bruce Fields
  2008-06-24 20:31             ` Benny Halevy
@ 2008-06-24 20:38             ` Trond Myklebust
  1 sibling, 0 replies; 19+ messages in thread
From: Trond Myklebust @ 2008-06-24 20:38 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Tom Tucker, linux-nfs

On Tue, 2008-06-24 at 15:58 -0400, J. Bruce Fields wrote:
> My understanding is that it's A--e.g. checkpatch.pl has a dumb regex
> that just checks for these assignments and whines about them however
> they're used.
> 
> Google doesn't find me any more detailed discussion of the policy.

We had that discussion about 10 years ago.

What followed was a general purge of all zero initialisations of global
and static variables in the kernel, and the official policy is that
initialising those variables is wrong, and should _not_ be done.

The reason is that BSS is automatically zeroed out, and so adding an
initialiser causes unnecessary code bloat and slows down the boot
process.

Trond


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-06-24 20:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <12120836962076-git-send-email-tom@opengridcomputing.com>
2008-05-29 22:10 ` [PATCH 0/11] svcrdma: WR context management bug fixes and cleanup J. Bruce Fields
2008-05-29 22:25   ` Tom Tucker
     [not found]     ` <1212099937.22478.3.camel-SMNkleLxa3ZimH42XvhXlA@public.gmane.org>
2008-05-29 22:26       ` Roland Dreier
     [not found] ` <12120836962324-git-send-email-tom@opengridcomputing.com>
2008-06-16 19:48   ` [PATCH 01/11] svcrdma: Add a type for keeping NFS RPC mapping J. Bruce Fields
2008-06-21 16:31     ` Tom Tucker
2008-06-23 18:27       ` J. Bruce Fields
2008-06-24  2:58         ` Tom Tucker
2008-06-24 19:58           ` J. Bruce Fields
2008-06-24 20:31             ` Benny Halevy
2008-06-24 20:38             ` Trond Myklebust
     [not found]   ` <12120836963727-git-send-email-tom@opengridcomputing.com>
2008-06-16 21:04     ` [PATCH 02/11] svcrdma: Use RPC reply map for RDMA_WRITE processing J. Bruce Fields
2008-06-21 16:26       ` Tom Tucker
2008-06-23 18:21         ` J. Bruce Fields
2008-06-24  2:29           ` Tom Tucker
2008-06-21 16:51       ` Tom Tucker
2008-06-23 18:51         ` J. Bruce Fields
2008-06-24  3:02           ` Tom Tucker
     [not found]     ` <1212083697950-git-send-email-tom@opengridcomputing.com>
     [not found]       ` <1212083697236-git-send-email-tom@opengridcomputing.com>
     [not found]         ` <12120836973390-git-send-email-tom@opengridcomputing.com>
     [not found]           ` <12120836973638-git-send-email-tom@opengridcomputing.com>
     [not found]             ` <12120836973072-git-send-email-tom@opengridcomputing.com>
     [not found]               ` <12120836972503-git-send-email-tom@opengridcomputing.com>
     [not found]                 ` <12120836973166-git-send-email-tom@opengridcomputing.com>
     [not found]                   ` <12120836972648-git-send-email-tom@opengridcomputing.com>
2008-06-16 21:24                     ` [PATCH 10/11] svcrdma: Create a kmem cache for the WR contexts J. Bruce Fields
2008-06-21 17:08                       ` Tom Tucker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox