Re: [RFC,PATCH 11/15] knfsd: RDMA transport core

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tom Tucker <tom@opengridcomputing.com>
To: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: NeilBrown <neilb@suse.de>,
	"Talpey, Thomas" <Thomas.Talpey@netapp.com>,
	Linux NFS Mailing List <nfs@lists.sourceforge.net>,
	Peter Leckie <pleckie@melbourne.sgi.com>,
	Greg Banks <gnb@sgi.com>
Subject: Re: [RFC,PATCH 11/15] knfsd: RDMA transport core
Date: Fri, 18 May 2007 23:32:45 -0500	[thread overview]
Message-ID: <C273E81D.308E5%tom@opengridcomputing.com> (raw)
In-Reply-To: <1179523079.6488.156.camel@heimdal.trondhjem.org>




On 5/18/07 4:17 PM, "Trond Myklebust" <trond.myklebust@fys.uio.no> wrote:

> On Fri, 2007-05-18 at 15:07 -0500, Tom Tucker wrote:
>> On Fri, 2007-05-18 at 15:07 -0400, Trond Myklebust wrote:
>> [...snip...]
>>> +       xprt->sc_ctxt_max);
>>>> +
>>>> + spin_lock_irqsave(&xprt->sc_ctxt_lock, flags);
>>> 
>>> Why do you need an irqsafe spinlock to protect this list?
>>> 
>> 
>> The svc_rdma_put_context function is called from interrupt context. This
>> function adds free contexts back to this list.
> 
> Why not move the calls to rq_cq_reap() and sq_cq_reap() into the
> tasklet? That way you can get away with only a bh-safe lock instead of
> having to shut down interrupts again.
> 

Yes, this could be done. The current irq lock is there to accommodate the SQ
CQ side. The RQ CQ processing is already done in a tasklet because it has to
call svc_sock_enqueue which takes _bh locks.

I would need an independent dto_q protected by irqsave locks if I wanted to
avoid unnecessary calls to ib_req_notify_cq. This call is very expensive (1+
us).

In the end I think we'd swap one irqsave lock per context (WR) for one
irqsave lock per interrupt and transport. That's probably a pretty good
trade off. 

>>>> + while (xprt->sc_ctxt_cnt < target) {
>>>> +  xprt->sc_ctxt_cnt ++;
>>>> +  spin_unlock_irqrestore(&xprt->sc_ctxt_lock, flags);
>>>> +
>>>> +  ctxt = kmalloc(sizeof(*ctxt), GFP_KERNEL);
>>>> +
>>>> +  spin_lock_irqsave(&xprt->sc_ctxt_lock, flags);
>>> 
>>> You've now dropped the spinlock. How can you know that the condition
>>> xprt->sc_ctxt_cnt <= target is still valid?
>>> 
>> 
>> I increment the sc_ctxt_cnt with the lock held. If I'm at the limit, the
>> competing thread will find it equal -- correct?
> 
> Fair enough.
> 
>>>> +  if (ctxt) {
>>>> +   at_least_one = 1;
>>>> +   ctxt->next = xprt->sc_ctxt_head;
>>>> +   xprt->sc_ctxt_head = ctxt;
>>>> +  } else {
>>>> +   /* kmalloc failed...give up for now */
>>>> +   xprt->sc_ctxt_cnt --;
>>>> +   break;
>>>> +  }
>>>> + }
>>>> + spin_unlock_irqrestore(&xprt->sc_ctxt_lock, flags);
>>>> +
>>>> + return at_least_one;
>>>> +}
>>>> +
>>>> +struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
>>>> +{
>>>> + struct svc_rdma_op_ctxt *ctxt;
>>>> + unsigned long flags;
>>>> +
>>>> + while (1) {
>>>> +  spin_lock_irqsave(&xprt->sc_ctxt_lock, flags);
>>>> +  if (unlikely(xprt->sc_ctxt_head == NULL)) {
>>>> +   /* Try to bump my cache. */
>>>> +   spin_unlock_irqrestore(&xprt->sc_ctxt_lock, flags);
>>>> +
>>>> +   if (rdma_bump_context_cache(xprt))
>>>> +    continue;
>>>> +
>>>> +   printk(KERN_INFO "svcrdma: sleeping waiting for context "
>>>> +          "memory on xprt=%p\n",
>>>> +          xprt);
>>>> +   schedule_timeout_uninterruptible(msecs_to_jiffies(500));
>>> 
>>>   (HZ >> 1)
>>> However this is rather naughty: you are tying up an nfsd thread that
>>> could be put to doing useful work elsewhere.
>>> 
>> 
>> True, it could be processing requests for another client.
>> 
>> Here was my dilemma. I can actually predict what the worst case number
>> is so that I'm guaranteed not to fail. The problem is that this is a
>> very big number and I wanted to have a footprint driven by load as
>> opposed to worst-case. This led to the code above.
>> 
>> As an intermediate approach, I believe that I need about 4 contexts to
>> be guaranteed I can process a request. I could attempt to acquire them
>> all (at the top of svc_rdma_recvfrom) and cache them in the rqstp
>> structure. If I couldn't get them, I would just return 0 (EAGAIN). This
>> would be marginally wasteful on a given request, but would avoid ever
>> sleeping for a context.
>> 
>> Comments?
> 
> How about just keeping a fixed size pool, and deferring handling more
> RDMA requests until you have enough free contexts in your pool to
> guarantee that you can complete the request? That would be analogous to
> the way the existing RPC server socket code handles the issue of socket
> send buffers.

Yes, I think the we're saying the same thing with respect to the deferral
mechanism, however, I'm proposing attempting to grow the pool when below the
limit but deferring instead of blocking if I couldn't get the memory. I'm
worried about the size of the RDMA transport's memory footprint with lots of
clients/mounts. 
 
> 
>>>> +   continue;
>>>> +  }
>>>> +  ctxt = xprt->sc_ctxt_head;
>>>> +  xprt->sc_ctxt_head = ctxt->next;
>>>> +  spin_unlock_irqrestore(&xprt->sc_ctxt_lock, flags);
>>>> +  ctxt->xprt = xprt;
>>>> +  INIT_LIST_HEAD(&ctxt->dto_q);
>>>> +  break;
>>>> + }
>>>> + ctxt->count = 0;
>>>> + return ctxt;
>>>> +}
> 
> <snip>
> 
>>>> +
>>>> +struct page *svc_rdma_get_page(void)
>>>> +{
>>>> + struct page *page;
>>>> +
>>>> + while ((page = alloc_page(GFP_KERNEL))==NULL) {
>>>> +  /* If we can't get memory, wait a bit and try again */
>>>> +  printk(KERN_INFO "svcrdma: out of memory...retrying in 1000
>>>> jiffies.\n");
>>>> +  schedule_timeout_uninterruptible(msecs_to_jiffies(1000));
>>> HZ
>>> See comment above about tying up threads. Also note that you are
>>> probably better off using __GFP_NOFAIL instead of the loop.
>>> 
>> 
>> I thought that the __GFP_NOFAIL pool was a precious resource and that a
>> memory out condition was one of the places you'd actually want to sleep.
>> I basically copied this approach from the svc_recv implementation in
>> svcsock.c
> 
> AFAIK, __GFP_NOFAIL just means keep retrying until the allocation
> succeeds. It is not tied to a specific resource.
> 

Ah, you're right. But I think think the retry is attached to a timeout
(HZ/50) or whenever any backing store device exits it's write congestion
callback. Basically, the retries will occur a lot more aggressively. Maybe
this is OK? It will certainly make the code look better.

BTW, the code at the top of svc_recv does the same thing. Should it be
changed too?

> 
> 



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

next prev parent reply	other threads:[~2007-05-19  4:27 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-18 17:45 [RFC,PATCH 11/15] knfsd: RDMA transport core Tom Tucker
2007-05-18 19:07 ` Trond Myklebust
2007-05-18 20:07   ` Tom Tucker
2007-05-18 21:17     ` Trond Myklebust
2007-05-19  4:32       ` Tom Tucker [this message]
2007-05-21  7:16     ` Neil Brown
2007-05-21 16:02       ` Tom Tucker
2007-05-22  5:36         ` Neil Brown
2007-05-22 15:23           ` Tom Tucker
2007-05-18 19:24 ` J. Bruce Fields
2007-05-18 19:36   ` Tom Tucker
2007-05-18 19:42     ` J. Bruce Fields
2007-05-23 14:09     ` Greg Banks
2007-05-23 14:43       ` Tom Tucker
2007-05-23 14:55         ` Greg Banks
2007-05-23 15:03           ` Trond Myklebust
2007-05-23 15:12             ` Tom Tucker
2007-05-23 15:37               ` Trond Myklebust
2007-05-23 16:02                 ` Tom Tucker
2007-05-23 16:35                   ` Greg Banks
2007-05-23 16:29             ` Greg Banks
2007-05-23 18:07               ` Trond Myklebust
2007-05-23 18:19               ` Talpey, Thomas
2007-05-23 18:37                 ` Trond Myklebust
2007-05-23 18:59                   ` Talpey, Thomas
2007-05-23 20:01                     ` Trond Myklebust
2007-05-23 21:00                       ` Talpey, Thomas
2007-05-24  8:35                         ` Greg Banks
2007-05-24 13:45                           ` Talpey, Thomas
2007-05-23 15:03           ` Tom Tucker
2007-05-21  7:11 ` Neil Brown
2007-05-21 10:02   ` Greg Banks
2007-05-21 15:58   ` Tom Tucker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C273E81D.308E5%tom@opengridcomputing.com \
    --to=tom@opengridcomputing.com \
    --cc=Thomas.Talpey@netapp.com \
    --cc=gnb@sgi.com \
    --cc=neilb@suse.de \
    --cc=nfs@lists.sourceforge.net \
    --cc=pleckie@melbourne.sgi.com \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.