Re: [RFC,PATCH 11/15] knfsd: RDMA transport core

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tom Tucker <tom@opengridcomputing.com>
To: Neil Brown <neilb@suse.de>
Cc: Tom Talpey <Thomas.Talpey@netapp.com>,
	Linux NFS Mailing List <nfs@lists.sourceforge.net>,
	Peter Leckie <pleckie@melbourne.sgi.com>,
	Greg Banks <gnb@sgi.com>,
	Trond Myklebust <trond.myklebust@fys.uio.no>
Subject: Re: [RFC,PATCH 11/15] knfsd: RDMA transport core
Date: Tue, 22 May 2007 10:23:24 -0500	[thread overview]
Message-ID: <1179847404.9389.34.camel@trinity.ogc.int> (raw)
In-Reply-To: <18002.33117.816973.42500@notabene.brown>

On Tue, 2007-05-22 at 15:36 +1000, Neil Brown wrote:
> On Monday May 21, tom@opengridcomputing.com wrote:
> > On Mon, 2007-05-21 at 17:16 +1000, Neil Brown wrote:
> > > On Friday May 18, tom@opengridcomputing.com wrote:
> > > > On Fri, 2007-05-18 at 15:07 -0400, Trond Myklebust wrote:
> > > > 
> > > > > > +	while (xprt->sc_ctxt_cnt < target) {
> > > > > > +		xprt->sc_ctxt_cnt ++;
> > > > > > +		spin_unlock_irqrestore(&xprt->sc_ctxt_lock, flags);
> > > > > > +
> > > > > > +		ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
> > > > > > +
> > > > > > +		spin_lock_irqsave(&xprt->sc_ctxt_lock, flags);
> > > > > 
> > > > > You've now dropped the spinlock. How can you know that the condition
> > > > > xprt->sc_ctxt_cnt <= target is still valid?
> > > > > 
> > > > 
> > > > I increment the sc_ctxt_cnt with the lock held. If I'm at the limit, the
> > > > competing thread will find it equal -- correct? 
> > > > 
> > > 
> > > I'm having trouble figuring out who the competing thread is.
> > > As far as I can see (Well... "have seen"), this is always called from
> > > the context of an nfsd thread which has sole access to the xprt.  So
> > > there is no race.
> > > ??
> > 
> > The callback handler for SQ CQ (send queue completion queue) is called
> > on interrupt context. This function returns contexts to this queue when
> > cleaning up completed SQ WR (send queue work requests). See sq_cq_reap
> > in svc_rdma_transport.c, line #329.  
> 
> OK....
> So presumably svc_rdma_put_context should be incrementing sc_ctxt_cnt
> when it puts the ctxt back on the list?

The variable name is a little confusing. It's really the size of the
whole pool, not how many are in the pool at any given time. This is the
pool growing code and we're serializing with other threads trying to
grow the pool concurrently, not an interrupt handler returning a newly
freed context to the pool.

> 
> In which case you still get a race where you can put more ctxt's on
> the list than you really want.  But that is hardly a big problem.
> 
> So the current locking around the "<, ++" seems a bit misleading.
> Should sc_ctxt_cnt be made and atomic_t, and then bump_context_cache
> can be:
> 
>   while (atomic_read(sc_ctxt_cnt) < target) {
> 	ctxt = kmalloc();
> 	if (ctxt) {
> 		spin_lock_irqsave()
> 		if (atomic_read() < target) {
> 			// add to list; ctxt = NULL;
> 			atomic_inc(sc_ctxt_cnt)
> 		}
> 		spin_unlock_irqrestore()
> 		free(ctxt);
> 	}
>   }
> 
> ??
> That makes it clearer (to me) what is being locked.
> 
> I also think svc_rdma_get_context could be simplified a bit too.
> 
>   while (1) {
>        lock;
>        ctxt = head-of-list;
>        unlock;
>        if (ctxt)
> 		break;
>        if (!bump_cache)
> 		msleep(500)
>   }
>   initialise ctxt
> 
> 
> But I'm still not sold on the idea of allocating these contexts on
> demand.  Can you provide a bit more data about that?

The context is the application data associated with an RDMA WR (work
request) and contains the list of pages involved with the I/O an SGE
that refers to the pages and the type of I/O operation that was
requested. It tells the I/O completion handlers what to do, e.g. wake up
a thread waiting on a read completion, enqueue data to the recvfrom
dto_q, etc...

> e.g. maximum number of contexts that might be required for one
> request, vs typical number of contexts

Two at a minimum: One for the RQ WR (the request), one for the SQ WR
(the response). The limit is bounded by how deep your SQ is, but for
example, if the client does a NFS READ of 1MB, and the scatter gather
limit in the adapter is 4, then you'd get 1MB / (4 * 4k per page) + 1 =
65!

> .  And how big is a context
> anyway?

Roughly a page. The principle reason they're big is that it needs to
keep an array of the pages used in the request and for big-data that's
256 ptrs to pages.

> Also, what is the lifetime of these contexts?  Presumably they are
> allocated to handle a single request, and then get passing into the
> transport until the reply is fully sent?  

Yes. BTW, they're used on receive too.

> That makes them a lot like
> the rq_pages allocated as needed at the start of svc_recv.  Could the
> contexts be handled in much the same way?
> 

Kind of. The pages for an RDMA I/O are given to the adapter in advance
(like an mbuf in the stack). This is part of what the context is keeping
track of. What recvfrom actually does is swap the pages out of the rqstq
rq_pages array for the ones in the completing context avoiding the copy.
So after svc_recv has dutifully pre-allocated the pages, the RDMA
transport promptly swaps them with the ones in the completing context.

> Elsewhere nfsd goes to some effort to pre-allocate any needed memory.
> It would be nice if it were consistent.
> 

I think we would need another transport specific function due to the
fact that the buffers have to be around prior to the data ready
indication. This would make the rdma transport slightly more efficient
as it would avoid the allocation for the rq_arg pages.

> 
> NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

next prev parent reply	other threads:[~2007-05-22 14:38 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-18 17:45 [RFC,PATCH 11/15] knfsd: RDMA transport core Tom Tucker
2007-05-18 19:07 ` Trond Myklebust
2007-05-18 20:07   ` Tom Tucker
2007-05-18 21:17     ` Trond Myklebust
2007-05-19  4:32       ` Tom Tucker
2007-05-21  7:16     ` Neil Brown
2007-05-21 16:02       ` Tom Tucker
2007-05-22  5:36         ` Neil Brown
2007-05-22 15:23           ` Tom Tucker [this message]
2007-05-18 19:24 ` J. Bruce Fields
2007-05-18 19:36   ` Tom Tucker
2007-05-18 19:42     ` J. Bruce Fields
2007-05-23 14:09     ` Greg Banks
2007-05-23 14:43       ` Tom Tucker
2007-05-23 14:55         ` Greg Banks
2007-05-23 15:03           ` Trond Myklebust
2007-05-23 15:12             ` Tom Tucker
2007-05-23 15:37               ` Trond Myklebust
2007-05-23 16:02                 ` Tom Tucker
2007-05-23 16:35                   ` Greg Banks
2007-05-23 16:29             ` Greg Banks
2007-05-23 18:07               ` Trond Myklebust
2007-05-23 18:19               ` Talpey, Thomas
2007-05-23 18:37                 ` Trond Myklebust
2007-05-23 18:59                   ` Talpey, Thomas
2007-05-23 20:01                     ` Trond Myklebust
2007-05-23 21:00                       ` Talpey, Thomas
2007-05-24  8:35                         ` Greg Banks
2007-05-24 13:45                           ` Talpey, Thomas
2007-05-23 15:03           ` Tom Tucker
2007-05-21  7:11 ` Neil Brown
2007-05-21 10:02   ` Greg Banks
2007-05-21 15:58   ` Tom Tucker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1179847404.9389.34.camel@trinity.ogc.int \
    --to=tom@opengridcomputing.com \
    --cc=Thomas.Talpey@netapp.com \
    --cc=gnb@sgi.com \
    --cc=neilb@suse.de \
    --cc=nfs@lists.sourceforge.net \
    --cc=pleckie@melbourne.sgi.com \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.