Linux NFS development
 help / color / mirror / Atom feed
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Jeff Layton <jeff.layton@primarydata.com>
Cc: Trond Myklebust <trondmy@gmail.com>,
	Chris Worley <chris.worley@primarydata.com>,
	linux-nfs@vger.kernel.org, Ben Myers <bpm@sgi.com>
Subject: Re: [PATCH 3/4] sunrpc: convert to lockless lookup of queued server threads
Date: Mon, 8 Dec 2014 14:58:55 -0500	[thread overview]
Message-ID: <20141208195855.GC16612@fieldses.org> (raw)
In-Reply-To: <20141208145429.56234bf2@tlielax.poochiereds.net>

On Mon, Dec 08, 2014 at 02:54:29PM -0500, Jeff Layton wrote:
> On Mon, 8 Dec 2014 13:57:31 -0500
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Tue, Dec 02, 2014 at 11:50:24AM -0500, J. Bruce Fields wrote:
> > > On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote:
> > > > On Tue, 2 Dec 2014 06:57:50 -0500
> > > > Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > > 
> > > > > On Mon, 1 Dec 2014 19:38:19 -0500
> > > > > Trond Myklebust <trondmy@gmail.com> wrote:
> > > > > 
> > > > > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > > > > > I find it hard to think about how we expect this to affect performance.
> > > > > > > So it comes down to the observed results, I guess, but just trying to
> > > > > > > get an idea:
> > > > > > >
> > > > > > >         - this eliminates sp_lock.  I think the original idea here was
> > > > > > >           that if interrupts could be routed correctly then there
> > > > > > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > > > > > >           we understand why that didn't pan out?  Is hardware capable of
> > > > > > >           doing this really rare, or is it just too hard to configure it
> > > > > > >           correctly?
> > > > > > 
> > > > > > One problem is that a 1MB incoming write will generate a lot of
> > > > > > interrupts. While that is not so noticeable on a 1GigE network, it is
> > > > > > on a 40GigE network. The other thing you should note is that this
> > > > > > workload was generated with ~100 clients pounding on that server, so
> > > > > > there are a fair amount of TCP connections to service in parallel.
> > > > > > Playing with the interrupt routing doesn't necessarily help you so
> > > > > > much when all those connections are hot.
> > > > > > 
> > > > 
> > > > In principle though, the percpu pool_mode should have alleviated the
> > > > contention on the sp_lock. When an interrupt comes in, the xprt gets
> > > > queued to its pool. If there is a pool for each cpu then there should
> > > > be no sp_lock contention. The pernode pool mode might also have
> > > > alleviated the lock contention to a lesser degree in a NUMA
> > > > configuration.
> > > > 
> > > > Do we understand why that didn't help?
> > > 
> > > Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if
> > > not entirely orthogonal problem.
> > > 
> > > (And I thought it should be addressable separately; Trond and I talked
> > > about this in Westford.  I think it currently wakes a thread to handle
> > > each individual tcp segment--but shouldn't it be able to do all the data
> > > copying in the interrupt and wait to wake up a thread until it's got the
> > > entire rpc?)
> > 
> > By the way, Jeff, isn't this part of what's complicating the workqueue
> > change?  That would seem simpler if we didn't need to queue work until
> > we had the full rpc.
> > 
> 
> No, I don't think that really adds much in the way of complexity there.
> 
> I have that set working. Most of what's holding me up from posting the
> next iteration of that set is performance. So far, my testing shows
> that the workqueue-based code is slightly slower. I've been trying to
> figure out why that is and whether I can do anything about it. Maybe
> I'll go ahead and post it as a second RFC set, until I can get to the
> bottom of the perf delta.
> 
> I have pondered doing what you're suggesting above though and it's not a
> trivial change.
> 
> The problem is that all of the buffers into which we do receives are
> associated with the svc_rqst (which we don't really have when the
> interrupt comes in), and not the svc_xprt (which we do have at that
> point).
> 
> So, you'd need to restructure the code to hang a receive buffer off
> of the svc_xprt.

Have you looked at svsk->sk_pages and svc_tcp_{save,restore}_pages?

--b.

> Once you receive an entire RPC, you'd then have to
> flip that buffer over to a svc_rqst, queue up the job and grab a new
> buffer for the xprt (maybe you could swap them?).
> 
> The problem is what to do if you don't have a buffer (or svc_rqst)
> available when an IRQ comes in. You can't allocate one from softirq
> context, so you'd need to offload that case to a workqueue or something
> anyway (which adds a bit of complexity as you'd then have to deal with
> two different receive paths).
> 
> I'm also not sure about RDMA. When you get an RPC, the server usually
> has to do an RDMA READ from the client to pull all of the data in. I
> don't think you want to do that from softirq context, so that would
> also need to be queued up somehow.
> 
> All of that said, it would probably reduce some context switching if
> we can make that work. Also, I suspect that doing that in the context
> of the workqueue-based code would probably be at least a little simpler.
> 
> -- 
> Jeff Layton <jlayton@primarydata.com>

  reply	other threads:[~2014-12-08 19:58 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-21 19:19 [PATCH 0/4] sunrpc: reduce pool->sp_lock contention when queueing a xprt for servicing Jeff Layton
2014-11-21 19:19 ` [PATCH 1/4] sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it Jeff Layton
2014-12-01 22:44   ` J. Bruce Fields
2014-12-01 23:05     ` Jeff Layton
2014-12-01 23:36       ` Trond Myklebust
2014-12-02  0:29         ` Jeff Layton
2014-12-02  0:52           ` Trond Myklebust
2014-12-09 17:05             ` J. Bruce Fields
2014-11-21 19:19 ` [PATCH 2/4] sunrpc: fix potential races in pool_stats collection Jeff Layton
2014-11-21 19:19 ` [PATCH 3/4] sunrpc: convert to lockless lookup of queued server threads Jeff Layton
2014-12-01 23:47   ` J. Bruce Fields
2014-12-02  0:38     ` Trond Myklebust
2014-12-02 11:57       ` Jeff Layton
2014-12-02 12:14         ` Jeff Layton
2014-12-02 16:50           ` J. Bruce Fields
2014-12-02 18:53             ` Ben Myers
2014-12-09 17:04               ` J. Bruce Fields
2014-12-08 18:57             ` J. Bruce Fields
2014-12-08 19:54               ` Jeff Layton
2014-12-08 19:58                 ` J. Bruce Fields [this message]
2014-12-08 20:24                   ` Jeff Layton
2014-12-09 16:57           ` J. Bruce Fields
2014-11-21 19:19 ` [PATCH 4/4] sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt Jeff Layton
2014-12-02 13:31   ` Jeff Layton
2014-12-09 16:36     ` J. Bruce Fields
2014-11-25 21:25 ` [PATCH 0/4] sunrpc: reduce pool->sp_lock contention when queueing a xprt for servicing Jeff Layton
2014-11-26  0:09   ` J. Bruce Fields
2014-11-26  0:38     ` Jeff Layton
2014-11-26  2:40       ` J. Bruce Fields
2014-11-26 11:12         ` Jeff Layton
2014-12-09 16:44 ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141208195855.GC16612@fieldses.org \
    --to=bfields@fieldses.org \
    --cc=bpm@sgi.com \
    --cc=chris.worley@primarydata.com \
    --cc=jeff.layton@primarydata.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox