From: Tom Tucker <tom@opengridcomputing.com>
To: Greg Banks <gnb@sgi.com>
Cc: NeilBrown <neilb@suse.de>,
"J. Bruce Fields" <bfields@fieldses.org>,
"Talpey, Thomas" <Thomas.Talpey@netapp.com>,
Linux NFS Mailing List <nfs@lists.sourceforge.net>,
Peter Leckie <pleckie@melbourne.sgi.com>
Subject: Re: [RFC,PATCH 4/14] knfsd: has_wspace per transport
Date: Wed, 23 May 2007 00:22:44 -0500 [thread overview]
Message-ID: <C27939D4.30F92%tom@opengridcomputing.com> (raw)
In-Reply-To: <20070523023206.GA14076@sgi.com>
On 5/22/07 9:32 PM, "Greg Banks" <gnb@sgi.com> wrote:
> On Tue, May 22, 2007 at 12:34:23PM -0500, Tom Tucker wrote:
>> On Tue, 2007-05-22 at 21:16 +1000, Greg Banks wrote:
>>>
>>> The wspace management code in knfsd is designed to avoid having
>>> nfsds ever block in the network stack; I don't see how the NFS/RDMA
>>> code achieves that. Or is there something I've missed?
>>>
>>
>> You're dead on. We should implement wspace properly. I'll look into
>> fixing this.
>
> Great.
>
>> The bulk of the stats were exported in order to evaluate different I/O
>> models. For example, do we poll some number of times on an SQ stall? How
>> about an RQ stall, etc... I think many of these stats should be removed.
>> I'd appreciate feedback on which ones we should retain.
>
> Actually I quite like the stats, although the reap_[sr]q_at_* counters
> don't seem to be used. For rdma_stat_rq_prod, I would have had a
> counter increment where wc.status reports an error, thus futzing with
> the global cacheline only once in the normal success path.
>
> But I think it would be better to report them through the existing
> /proc/net/rpc/nfsd mechanism, i.e. add an RDMA-specific call to
> svc_seq_show() and add your new counters to struct svc_stat.
>
I need to do a little clean up (e.g. atomic_inc()) and then I'll add them as
suggested. If there are dissenters, please speak now...
>
>> What's happening here is that you're pounding the server with NFS WRITE
>> and recvfrom is dutifully queuing another thread to process the next RPC
>> in the RQ, which happens to require another RDMA READ and another wait
>> and so on. As you point out, this is not productive and quickly consumes
>> all your threads -- Ugh.
>
> Exactly.
>
>> A simple optimization to the current design for this case is that if the
>> RPC contains a read-list don't call svc_sock_enqueue until _after_ the
>> RDMA READ completes.
>
> I assume you meant "don't release the SK_BUSY bit", because
> svc_sock_enqueue will be called from the dto tasklet every time
> a new call arrives whether or not we've finished the RDMA READ.
>
Right. Good point. I might have spent a few hours debugging that one ;-)
>> The current thread will wait, but the remaining
>> threads (127 in your case) will be free to do other work for other mount
>> points.
>>
>> This only falls over in the limit of 128 mounts all pounding the server
>> with writes. Comments?
>
> A common deployment scenario I would expect to see would have at
> least twice as many clients as threads, and SGI have successfully
> tested scenarios with a 16:1 client:thread ratio. So I don't think
> we can solve the problem in any way which ties down a thread for
> up to 6 seconds per client, because there won't be enough threads
> to go around.
Agreed. But isn't that the 100 dead adapter scenario? Someone has to wait.
Who?
>
> The other case that needs to work is a single fat client trying to
> throw as many bits as possible down the wire. For this case your
> proposal would limit the server to effectively serving WRITE RPCs
> serially,
Not entirely. The approach would overlap the RPC fetch of the next WRITE
with the response to the previous.
>... with what I presume would be a catastrophic effect on
> write performance. What you want is for knfsd to be able to throw
> as many threads at the single fat client as possible.
>
> The knfsd design actually handles both these extreme cases quite
> well today with TCP sockets. The features which I think are key
> to providing this flexibility are:
>
> * a pool of threads whose size is only loosely tied to the number
> of clients
>
> * threads are not tied to clients, they compete for individual
> incoming RPCs
>
> * threads take care to avoid blocking on the network
>
> The problem we have achieving the last point for NFS/RDMA is that
> RDMA READs are fundamentally neither a per-svc_sock nor a per-thread
> concept. To achieve parallelism you want multiple sequences of RDMA
> READs proceeding per svc_sock. To avoid blocking threads you want a
> thread to be able to abandon a sequence of RDMA READs partway through
> and go back to it's main loop. Doing both implies a new object to hold
> the state for a single sequence of RDMA READs for a single WRITE RPC.
>
> So I think the "right" way to handle RDMA READs is to define a
> new struct, for argument's sake "struct svc_rdma_read", containing
> some large fraction of the auto variables currently in the function
> rdma_read_xdr() and some from it's caller. When an RPC with a read
> list arrives, allocate a struct svc_rdma_read and add it to a list
> of such structs on the corresponding struct svcxprt_rdma. Issue some
> READ WRs, until doing so would block. When the completions for issued
> WRs arrive, mark the struct svc_rdma_read, instead of waking the
> nfsd as you do now, set SK_DATA on the svcxprt_rdma and enqueue it.
> When an nfsd later calls svc_rdma_recvfrom(), first check whether any
> of the struct svc_rdma_reads in flight have progressed, and depending
> on whether they're complete either issue some more READ WRs or finalise
> the struct svc_rdma_read by setting up the thread's page array.
>
> In other words, invert the rdma_read_xdr() logic so nfsd return
> to the main loop instead of blocking.
>
> Unfortunately it's kind of a major change. Thoughts?
>
Ok, I love it and I hate it :-) This is the consolidated waiter strategy
that solves all the problems...except ...does this expose any read/write
ordering issues at the client? Couldn't the client issue a write followed by
a read and get the original data?
That's the hate it part. If we need to decide which requests are allowed to
proceed on a QP with outstanding READ WR, things get messy quick.
Is there no such requirement?
>>> And every now and again something goes awry in
>>> IB land and each thread ends up waiting for 6 seconds or more.
>>
>> Yes. Note that when this happens, the connection gets terminated.
>
> Indeed. Meanwhile, for the last 6 seconds all the nfsds are
> blocked waiting for the one client, and none of the other clients
> are getting any service.
We need to think about this. A dead adapter causes havoc.
>
> Greg.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
next prev parent reply other threads:[~2007-05-23 5:17 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-05-16 19:22 [RFC,PATCH 4/14] knfsd: has_wspace per transport Greg Banks
2007-05-16 21:10 ` J. Bruce Fields
2007-05-17 7:12 ` Greg Banks
2007-05-17 10:30 ` Neil Brown
2007-05-17 12:39 ` Talpey, Thomas
2007-05-18 0:30 ` Neil Brown
2007-05-18 4:05 ` Greg Banks
2007-05-18 13:33 ` Tom Tucker
2007-05-18 13:39 ` Tom Tucker
2007-05-22 11:16 ` Greg Banks
2007-05-22 17:34 ` Tom Tucker
2007-05-23 2:32 ` Greg Banks
2007-05-23 5:22 ` Tom Tucker [this message]
2007-05-23 6:41 ` Greg Banks
2007-05-23 13:36 ` Chuck Lever
2007-05-23 14:39 ` Greg Banks
2007-05-23 20:11 ` Chuck Lever
2007-05-18 13:44 ` Talpey, Thomas
2007-05-18 6:21 ` Greg Banks
2007-05-18 6:38 ` Neil Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=C27939D4.30F92%tom@opengridcomputing.com \
--to=tom@opengridcomputing.com \
--cc=Thomas.Talpey@netapp.com \
--cc=bfields@fieldses.org \
--cc=gnb@sgi.com \
--cc=neilb@suse.de \
--cc=nfs@lists.sourceforge.net \
--cc=pleckie@melbourne.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox