Re: [RFC,PATCH 4/14] knfsd: has_wspace per transport

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Greg Banks <gnb@sgi.com>
To: Tom Tucker <tom@opengridcomputing.com>
Cc: Neil Brown <neilb@suse.de>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Thomas Talpey <Thomas.Talpey@netapp.com>,
	Linux NFS Mailing List <nfs@lists.sourceforge.net>,
	Peter Leckie <pleckie@melbourne.sgi.com>
Subject: Re: [RFC,PATCH 4/14] knfsd: has_wspace per transport
Date: Tue, 22 May 2007 21:16:54 +1000	[thread overview]
Message-ID: <20070522111653.GF1202@sgi.com> (raw)
In-Reply-To: <1179495234.23385.30.camel@trinity.ogc.int>

On Fri, May 18, 2007 at 08:33:54AM -0500, Tom Tucker wrote:
> On Fri, 2007-05-18 at 14:05 +1000, Greg Banks wrote:
> > On Thu, May 17, 2007 at 08:30:39PM +1000, Neil Brown wrote:
> > >  Do that mean that RDMA will never
> > > reject a write due to lack of space? 
> > 
> > No, it means that the current RDMA send code will block waiting
> > for space to become available.  That's right, nfsd threads block on
> > the network.  Steel yourself, there's worse to come.
> > 
> 
> Uh... Not really. The queue depths are designed to match credits to
> worst case reply sizes. In the normal case, it should never have to
> wait. The wait is to catch the margins in the same way that a kmalloc
> will wait for memory to become available. 

This is news to me, but then I just read that code not write it ;-)
I just poked around in your GIT tree and AFAICT the software queue
depth limits are set by this logic:

    743 static int
    744 svc_rdma_accept(struct svc_rqst *rqstp)
    745 {
    ...
    775         ret = ib_query_device(newxprt->sc_cm_id->device, &devattr);
    ...
    788         newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
    789                                    (size_t)svcrdma_max_requests);
    790         newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;

     54 unsigned int svcrdma_max_requests = RPCRDMA_MAX_REQUESTS;

    143 #define RPCRDMA_SQ_DEPTH_MULT   8
    145 #define RPCRDMA_MAX_REQUESTS    16

A brief experiment shows that devattr.max_qp_wr = 16384 on recent
Mellanox cards, so the actual sc_sq_depth value will be ruled
by svcrdma_max_requests, and be 128.

The worst case is a 4K page machine writing a 1MB reply to a READ rpc,
which if I understand the code correctly uses a WR per page plus one
for the reply header, or 1024/4+1 = 257 WRs.  Now imagine all the nfsd
threads(*) trying to do that to the same QP.  SGI runs with 128 threads.

In this scenario, every call to svc_rdma_send() stands a good chance
of blocking.  The wait is not interruptible and has no timeout.  If
the client's HCA has a conniption under this load, the server will
have some large fraction of the nfsds unkillably blocked until the
server's HCA gives up and reports an error (I presume it does this?).

The wspace management code in knfsd is designed to avoid having
nfsds ever block in the network stack; I don't see how the NFS/RDMA
code achieves that.  Or is there something I've missed?

 * I would expect there to be a client-side limit on the number of
   outstanding READs so this doesn't normally happen, but part of the
   reason for knfsd's wspace management is to avoid DoS attacks from
   malicious or broken clients.

> There's actually a stat kept by the transport that counts the number of
> times it waits.

Ah, that would be rdma_stat_sq_starve?  It's been added since the
November code drop so I hadn't noticed it.  BTW is there a reason
for emitting statistics to userspace as sysctls?

> There is a place that a wait is done in the "normal" case and that's for
> the completion of an RDMA_READ in the process of gathering the data for
> and RPC on receive. That wait happens _every_ time.

Yes indeed, and this is the "worse" I was referring to.  I have
a crash dump in which 122 of the 128 nfsd threads are doing this:

 0 schedule+0x249c [0xa00000010052799c]
 1 schedule_timeout+0x1ac [0xa00000010052958c]
 2 rdma_read_xdr+0xf2c [0xa0000002216bcd8c]
 3 svc_rdma_recvfrom+0x1e5c [0xa0000002216bed7c]
 4 svc_recv+0xc8c [0xa00000022169f0ac]
 5 nfsd+0x1ec [0xa000000221d7504c]
 6 __svc_create_thread_tramp+0x30c [0xa0000002216976ac]
 7 kernel_thread_helper+0xcc [0xa00000010001290c]
 8 start_kernel_thread+0x1c [0xa0000001000094bc]

That doesn't leave a lot of threads to do real work, like waiting for
the filesystem.  And every now and again something goes awry in
IB land and each thread ends up waiting for 6 seconds or more.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

next prev parent reply	other threads:[~2007-05-22 11:17 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-16 19:22 [RFC,PATCH 4/14] knfsd: has_wspace per transport Greg Banks
2007-05-16 21:10 ` J. Bruce Fields
2007-05-17  7:12   ` Greg Banks
2007-05-17 10:30     ` Neil Brown
2007-05-17 12:39       ` Talpey, Thomas
2007-05-18  0:30         ` Neil Brown
2007-05-18  4:05       ` Greg Banks
2007-05-18 13:33         ` Tom Tucker
2007-05-18 13:39           ` Tom Tucker
2007-05-22 11:16           ` Greg Banks [this message]
2007-05-22 17:34             ` Tom Tucker
2007-05-23  2:32               ` Greg Banks
2007-05-23  5:22                 ` Tom Tucker
2007-05-23  6:41                   ` Greg Banks
2007-05-23 13:36                     ` Chuck Lever
2007-05-23 14:39                       ` Greg Banks
2007-05-23 20:11                         ` Chuck Lever
2007-05-18 13:44         ` Talpey, Thomas
2007-05-18  6:21       ` Greg Banks
2007-05-18  6:38         ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070522111653.GF1202@sgi.com \
    --to=gnb@sgi.com \
    --cc=Thomas.Talpey@netapp.com \
    --cc=bfields@fieldses.org \
    --cc=neilb@suse.de \
    --cc=nfs@lists.sourceforge.net \
    --cc=pleckie@melbourne.sgi.com \
    --cc=tom@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.